Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: parse content of PDF file

by marto (Bishop)
on Aug 03, 2007 at 13:55 UTC ( #630513=note: print w/replies, xml ) Need Help??


in reply to parse content of PDF file

Had they been converted to PDF via Acrobat (or such like) rather than scanned Images I would have suggested looking at CAM::PDF, however I think you are going to have to OCR each page of each document, since IIRC there won't be any (meaningful) text to parse within the PDF. You may want to start by looking at PDF::OCR (which IIRC uses Tesseract) , or some other OCR module from CPAN.

Check out the code.google page for tesseract-ocr

Update: Added link to tesseract-ocr

Hope this helps

Martin

Replies are listed 'Best First'.
Re^2: parse content of PDF file
by archfool (Monk) on Aug 03, 2007 at 14:07 UTC
    Cool! There is some software out there for OCR! I'm going to check it out myself! :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://630513]
help
Chatterbox?
[Discipulus]: when I was used to hand writing I reached a very good level, but chars were ~1mm
[LanX]: choroba: another way of gaining XP is up-voting my posts ... (works always!)
[moritz]: works always, one third of the time! :-)
[LanX]: Bah mine are different. Proof otherwise!
[choroba]: Re: Zen and the art of ignoring XP
LanX with a sufficiently large test sample of 1000 votes

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2017-09-26 11:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    During the recent solar eclipse, I:









    Results (293 votes). Check out past polls.

    Notices?