Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: Extracting content text from PDFs

by traveler (Parson)
on Apr 15, 2008 at 16:40 UTC ( #680571=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Extracting content text from PDFs
in thread Extracting content text from PDFs

If there are limits to what PDFs work and what don't I have not run into them :)
I have not seen stringify send garbage to the output unless I tried to display a picture. For real text, it seemed to work just fine. I have no idea about those problems as it has worked for the uses to which I have put it.
sorry


Comment on Re^3: Extracting content text from PDFs
Re^4: Extracting content text from PDFs
by almut (Canon) on Apr 16, 2008 at 03:32 UTC
    If there are limits to what PDFs work and what don't I have not run into them :)

    You've just been lucky so far :)

    Some PS/PDF tools are using font subsetting/re-encoding techniques which (when done in a certain way) can make automatic text extraction very hard. (I've tried to explain the method in more detail in another thread.)

    To illustrate, here's a sample PDF which you can view with Adobe Reader, xpdf, Ghostscript, etc. without problems (you should see the standard "lorem ipsum" text). Any attempt to extract the textual content will likely fail, however, although the file is a perfectly valid PDF containing nothing but regular text content (no images, no encryption, no other tricks) with all characters being part of the ASCII set.

    Of course, I deliberately created the file in the above mentioned way (as "proof of concept"), but there are actually PDF creation tools out there which do produce such problematic PDFs.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://680571]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2014-11-26 11:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (169 votes), past polls