good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
Re: PDF content and visuals testing best practicesby ateague (Monk) |
on Dec 20, 2013 at 17:48 UTC ( [id://1067966]=note: print w/replies, xml ) | Need Help?? |
I feel your pain. I have the (mis)fortune to have to deal with this on a daily basis as $WORK.
The strategy is to use pdftotext.exe to convert PDF into text *yuck* If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route. My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.I use the following command line: pdftohtml.exe -xml -zoom 1.4 [PDF FILE] This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi). Here is an example I am currently working with:
I can then use XML::Simple to slurp each <page> element into a hash and then use Test::More's eq_hash to compare my extracted data with my reference XML hash.
In Section
Seekers of Perl Wisdom
|
|