Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

ateague's scratchpad

by ateague (Monk)
on Mar 10, 2009 at 19:57 UTC ( #749712=scratchpad: print w/replies, xml ) Need Help??

At $WORK I use pdftohtml with the following command line: pdftohtml.exe -xml -stdout -zoom 1.4 [PDF FILE]

This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi, -stdout streams the output to STDOUT instead of writing it to a file).

Here is an example of what I typically work with:

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO: +</text> <text top="186" left="265" width="107" height="17" font="0">Audit Bil +ling</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROU +P:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice S +ort Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT +_BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</ +text> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</t +ext> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>

I can then use XML::Twig with XPath expressions to pull the exact xml nodes I want:

open (my $XML, "-|", "pdftohtml.exe -xml -zoom 1.4 -stdout $PDF_FILE") + or die "$!\n$^E"; # We are only interested in the text for the "ROUTE TO:" and "SORT + GROUP:" sections # Set the twig_handlers to extract the <text> nodes of interest; a +ll other nodes will be ignored # XPath queries provide an extra 1/20 inch padding on all sides to + take font and rendering variations into account my $t = XML::Twig->new( twig_handlers => { '//text[(@top >= 180 and @top <= 190) and (@left >= 100 an +d @left <= 111)]' => \&RouteTo, '//text[(@top >= 215 and @top <= 225) and (@left >= 260 an +d @left <= 270)]' => \&InvoiceSort, }, comments => 'drop', # remove any comments empty_tags => 'normal',# empty tags = <tag/> ); $t->parse($XML); $t->purge; close $XML;
Log In?

What's my password?
Create A New User
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2018-03-20 20:10 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (259 votes). Check out past polls.