Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Extracting content text from PDFs

by leocharre (Priest)
on Apr 01, 2008 at 16:34 UTC ( [id://677784]=note: print w/replies, xml ) Need Help??


in reply to Extracting content text from PDFs

Funny.. I was just updating PDF::OCR. Let me update the package first. I have PDF::GetImage and Image::OCR::Tesseract to update, then PDF::OCR.

It's pretty well tested, I use it a lot at work. If other people were to use it, I could get technical feedback to make it better.

I also have an indexer that records all text content and an interface to search it. Thus, you can have a million docs scanned in and search text content- then it tells you the file location, the page, and line number. That part is a little more complex, because indexing has to be done in parallel with multiple cpus- otherwise it would take 30 days for 60k docs.

update

Make sure to see the README, there are other notes and tesseract install help things in there to help out. I suggest you check out the packages individually instead of using cpan.

The whole thing works like a marvel. Take a look at the INSTALL help file, if you need help just email me per instructions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://677784]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2026-03-10 01:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.