Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Extracting content text from PDFs

by traveler (Parson)
on Apr 01, 2008 at 18:52 UTC ( #677806=note: print w/replies, xml ) Need Help??

in reply to Extracting content text from PDFs

PDF::API2 has a nice little hash with the document info. That makes it easy to put into a database or use otherwise. I've used it with great success to get the info similar to what you are planning.

HTH, --traveler

Replies are listed 'Best First'.
Re^2: Extracting content text from PDFs
by pat_mc (Pilgrim) on Apr 04, 2008 at 10:06 UTC
    Hi, traveler -

    Thanks for your suggestion. I have tried the module you suggest ... but unfortunately to no avail. Apart from the fact that it only extracted a fraction of the relevant document information its main drawback was that the  stringify method only produced a load of gibberish that flickered across my screen with plenty of beeps. Any idea why this is?

    I also wonder what the limitations on the PDF generation as such are that this module is subject to. Can it only handle PDFs which were generated by a certain application or with certain parameters?

    Thanks for your help nonetheless and cheers from Hamburg -

      If there are limits to what PDFs work and what don't I have not run into them :)
      I have not seen stringify send garbage to the output unless I tried to display a picture. For real text, it seemed to work just fine. I have no idea about those problems as it has worked for the uses to which I have put it.
        If there are limits to what PDFs work and what don't I have not run into them :)

        You've just been lucky so far :)

        Some PS/PDF tools are using font subsetting/re-encoding techniques which (when done in a certain way) can make automatic text extraction very hard. (I've tried to explain the method in more detail in another thread.)

        To illustrate, here's a sample PDF which you can view with Adobe Reader, xpdf, Ghostscript, etc. without problems (you should see the standard "lorem ipsum" text). Any attempt to extract the textual content will likely fail, however, although the file is a perfectly valid PDF containing nothing but regular text content (no images, no encryption, no other tricks) with all characters being part of the ASCII set.

        Of course, I deliberately created the file in the above mentioned way (as "proof of concept"), but there are actually PDF creation tools out there which do produce such problematic PDFs.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://677806]
[marto]: good morning all, TCIF
[Discipulus]: TCIF & chips marto! and sane dots too
[Corion]: Hi marto ;) All Hail Crunchy!

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2017-02-24 09:46 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (353 votes). Check out past polls.