http://www.perlmonks.org?node_id=630509

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a bunch of PDFs that were originally created in MS Word, printed, scanned and saved in PDF format. Now I need to run through those files, parse their text and single out all those files that fit some regexp. What is the best way to do it? Thanks for your help,

Replies are listed 'Best First'.
Re: parse content of PDF file
by marto (Cardinal) on Aug 03, 2007 at 13:55 UTC
    Had they been converted to PDF via Acrobat (or such like) rather than scanned Images I would have suggested looking at CAM::PDF, however I think you are going to have to OCR each page of each document, since IIRC there won't be any (meaningful) text to parse within the PDF. You may want to start by looking at PDF::OCR (which IIRC uses Tesseract) , or some other OCR module from CPAN.

    Check out the code.google page for tesseract-ocr

    Update: Added link to tesseract-ocr

    Hope this helps

    Martin
      Cool! There is some software out there for OCR! I'm going to check it out myself! :)
Re: parse content of PDF file
by archfool (Monk) on Aug 03, 2007 at 13:50 UTC
    If there were any reasonable way to do it, the software would cost a lot. Your key here was _scanned_. This means Optical Character Recognition (OCR), a very imperfect science at the moment. You will need OCR software, and there's very little free OCR software out there, let alone any Perl bindings to it.

    You'll need to convert the PDF to text with some OCR software FIRST. THEN running perl against it will be easy.