Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: PDF Text

by hesco (Deacon)
on Jun 13, 2008 at 02:24 UTC ( #691834=note: print w/replies, xml ) Need Help??

in reply to PDF Text

I've not used it, but will underscore the recommendation for swish-e, based on what I've heard about it.

But to answer your specific question, I use pdftotext to extract the ascii text from a compliant pdf file. Its a bash command line tool which is distributed with the xpdf reader application in many linux distributions. It won't work on scanned images (for which that PDF::OCR sounds particularly interesting; I'll have to check that out, ++ and thanks!). But for folks who export editable documents to PDF, it works like a charm (though is challenged a bit by multi-column content).

-- Hugh

if( $lal && $lol ) { $life++; }

Replies are listed 'Best First'.
Re^2: PDF Text
by leocharre (Priest) on Jun 13, 2008 at 13:38 UTC

    Something really interesting that happened at my office..

    We scan in a lot of documents. Now, the machines *are* able to encode OCR into the pdf document created. This makes indexing the documents relatively easy.

    BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents???

    They have a point.

    So I have my thing run at night.. collect info etc.
    That's why I needed muscle.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://691834]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2021-06-23 00:13 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (110 votes). Check out past polls.