Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Extracting content text from PDFs

by pat_mc (Pilgrim)
on Apr 01, 2008 at 15:36 UTC ( #677776=perlquestion: print w/replies, xml ) Need Help??

pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Hi All -

I am trying to extract content such as the document title or the content text from PDF files (ultimately hoping to search or categorise my collection of PDFs). So far, I have attempted to parse the PDF source file with regular expressions. While I notice that PDF section titles often come with the tag  /Title this does not seem to be the case always - and hence does not constitute a reliable approach for parsing the PDF.

Do you know of any reliable Perl approaches (e. g. suitable modules) for handling PDFs?

Thanks in advance for your help!

Cheers -

Pat

Replies are listed 'Best First'.
Re: Extracting content text from PDFs
by marto (Cardinal) on Apr 01, 2008 at 15:48 UTC
      marto -

      Thanks for your extremely helpful post ... and apologies for not having responded to it any earlier. My experience was exactly the one clinton describes in the thead you reference: modules like CAM-PDF only produce mildly helpful output. I am very grateful for the reference to the Linux tool pdftotext. With the option -htmlmeta it produces extremely useful, tagged output from a given PDF. This is precisely what I have been looking for in a long time. I will intensify my efforts related to this utility from now on.

      Thanks again!

      Pat
Re: Extracting content text from PDFs
by leocharre (Priest) on Apr 01, 2008 at 16:34 UTC
    Funny.. I was just updating PDF::OCR. Let me update the package first. I have PDF::GetImage and Image::OCR::Tesseract to update, then PDF::OCR.

    It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better.

    I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs.

    update

    Make sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan.

    The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions.

Re: Extracting content text from PDFs
by alexm (Chaplain) on Apr 01, 2008 at 16:11 UTC
Re: Extracting content text from PDFs
by traveler (Parson) on Apr 01, 2008 at 18:52 UTC
    PDF::API2 has a nice little hash with the document info. That makes it easy to put into a database or use otherwise. I've used it with great success to get the info similar to what you are planning.

    HTH, --traveler

      Hi, traveler -

      Thanks for your suggestion. I have tried the module you suggest ... but unfortunately to no avail. Apart from the fact that it only extracted a fraction of the relevant document information its main drawback was that the  stringify method only produced a load of gibberish that flickered across my screen with plenty of beeps. Any idea why this is?

      I also wonder what the limitations on the PDF generation as such are that this module is subject to. Can it only handle PDFs which were generated by a certain application or with certain parameters?

      Thanks for your help nonetheless and cheers from Hamburg -

      Pat
        If there are limits to what PDFs work and what don't I have not run into them :)
        I have not seen stringify send garbage to the output unless I tried to display a picture. For real text, it seemed to work just fine. I have no idea about those problems as it has worked for the uses to which I have put it.
        sorry

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://677776]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2021-06-22 23:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (110 votes). Check out past polls.

    Notices?