Re: PDF Text

by hesco (Deacon)
on Jun 13, 2008 at 02:24 UTC

in reply to PDF Text

I've not used it, but will underscore the recommendation for swish-e, based on what I've heard about it.

But to answer your specific question, I use pdftotext to extract the ascii text from a compliant pdf file. Its a bash command line tool which is distributed with the xpdf reader application in many linux distributions. It won't work on scanned images (for which that PDF::OCR sounds particularly interesting; I'll have to check that out, ++ and thanks!). But for folks who export editable documents to PDF, it works like a charm (though is challenged a bit by multi-column content).

-- Hugh

if( $lal && $lol ) { $life++; }

Re^2: PDF Text
on Jun 13, 2008 at 13:38 UTC

    Something really interesting that happened at my office..

    We scan in a lot of documents. Now, the machines *are* able to encode OCR into the pdf document created. This makes indexing the documents relatively easy.

    BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents???

    They have a point.

    So I have my thing run at night.. collect info etc.
    That's why I needed muscle.

