http://www.perlmonks.org?node_id=634794

BuddhaLovesPerl has asked for the wisdom of the Perl Monks concerning the following question:

Greetings to all monks,

 

I have the following problem and request your counsel. An external vendor generates over 100 PDF files every Sun. Due to historical problems, the on-call (aka me once every 4 weeks) has to log in at ~1am every Mon and open ~30 random files to make sure they "seem" correct. Once all are "validated" they are printed and must be received by 9am containing "correct" values.

 

Each file is a report generated by MicroStrategy and contains a number of complex tables of retail data (wtd, , mtd, ytd, ly, etc.) I have a perl scripts that count the total number of files, compare current sizes to historical averages, etc. but many “bad” files continue to slip through to eventual Sev 2 ticketdom.

 

Based on historical posts here, it looks like converting PDF with complex tables to either HTML or TXT is not pretty and not advisable. (side note: these posts are from 2002-2004). To this novice, a good solution would be to be able to parse the PDF file directly. Hence my questions:

 

1) Are there “open source” methods to perform regexp's on PDF files?

2) If not, do current PDF to TXT/HTML converters handle complex tables better?

3) If not, would converting PDF to DOC and then using DOC parsers be a practical and advisable solution?

 

Thank you,

--BLP

Replies are listed 'Best First'.
Re: How to parse PDF
by moritz (Cardinal) on Aug 24, 2007 at 07:36 UTC
    The simple answer is you have to try it.

    Pipe your pdf through the pdftotext tool (on Ubuntu in the poppler-utils package), and see if the output is parsable. That doesn't take very long, you can test it literally in two minutes.

    Take a look at PDF::Parse and PDF and see if they help you.

    But in principle it is much easier to validate the data before it is put into a PDF - have you tried to ask the external vendor if he could provide the same data in a format that is easier accessible?

      Excellent utility, especially when using the -layout command. Well done!
      Great tip on pdftotext, thank you!
Re: How to parse PDF
by andreas1234567 (Vicar) on Aug 24, 2007 at 08:11 UTC
    An external vendor generates over 100 PDF files every Sun.
    The best long term solution could be to convince the vendor to export data in a format more suited for parsing than PDF, such as XML or CSV. A PDF file is often a combination of vector graphics, text, and raster graphics. This makes it significantly harder to parse than markup languages.
    --
    Andreas
      Thanks to all of the public and private suggestions. I will try each and report back on which ones worked for future leverage.
Re: How to parse PDF
by CountZero (Bishop) on Aug 24, 2007 at 06:07 UTC
    Did you try PDF::OCR::Thorough? It claims to be able to extract text from all types of PDF-files. I did not try it yet (as I did not have the tesseract-OCR-engine installed).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James