Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: PDF content and visuals testing best practices

by ateague (Monk)
on Dec 20, 2013 at 17:48 UTC ( [id://1067966]=note: print w/replies, xml ) Need Help??


in reply to PDF content and visuals testing best practices

I feel your pain. I have the (mis)fortune to have to deal with this on a daily basis as $WORK.
The strategy is to use pdftotext.exe to convert PDF into text

*yuck*

If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route. My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.

I use the following command line: pdftohtml.exe -xml -zoom 1.4 [PDF FILE]

This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi).

Here is an example I am currently working with:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO:< +/text> <text top="186" left="265" width="107" height="17" font="0">Audit Bill +ing</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROUP +:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice So +rt Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT_ +BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</t +ext> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</te +xt> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>

I can then use XML::Simple to slurp each <page> element into a hash and then use Test::More's eq_hash to compare my extracted data with my reference XML hash.

Replies are listed 'Best First'.
Re^2: PDF content and visuals testing best practices
by ateague (Monk) on Dec 23, 2013 at 18:10 UTC
    Just reposting a PM from andreas1234567 for reference:
    I have trouble running pdftohtml.exe. It complains "freetype.dll" is missing (even though it *is* present in DLL dir)

    Depending which version of pdftohtml.exe (Dynamic vs Static) you run, you may need the following dlls:

    • freetype.dll
    • jpeg.dll
    • libpng12.dll
    • zlib1.dll

    These DLLs are found in the DLLs/ directory under Calibre Portable/Calibre/. You can do 1 of two things:

    1. Copy those DLLs into the same directory as pdftohtml.exe
    2. (Temporarily) add the path to the DLL directory to $ENV{PATH} in your script:
      { local $ENV{PATH} = $ENV{PATH}.";<PATH TO DLLs>;" system "pdftohtml.exe", "-xml", "<PDF FILE>"; }

      This did the trick for me:
      set EXEPATH=C:\Users\%USERNAME%\Calibre Portable\Calibre set PATH=%PATH%;%EXEPATH%\DLLS

      Thanks!

      --
      No matter how great and destructive your problems may seem now, remember, you've probably only seen the tip of them. [1]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1067966]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-19 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found