Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

PDF Modules Seeking Recommendations

by mitd (Curate)
on Nov 23, 2006 at 20:54 UTC ( #585782=perlquestion: print w/ replies, xml ) Need Help??
mitd has asked for the wisdom of the Perl Monks concerning the following question:

I am about to start a project that involves the parsing and extracting of information from PDF documents and it has been a while since I visited the CPAN PDF moudule pool.

Any recommendations from the monks?

mitd-Made in the Dark
I've always been astonished by the absurd turns
rivers have to make to flow under every bridge.

Comment on PDF Modules Seeking Recommendations
Re: PDF Modules Seeking Recommendations
by marto (Chancellor) on Nov 24, 2006 at 10:13 UTC
    mitd,

    In the past I have used CAM::PDF to deal with PDF files. I think it depends on exactly what information you wish to parse/extract, feel free to provide detailed examples of what you are trying to achieve. Check out the documentation and examples that are provided with CAM::PDF.

    Hope this helps,

    Martin
Re: PDF Modules Seeking Recommendations
by tbone1 (Monsignor) on Nov 24, 2006 at 12:46 UTC
    It's not part of CPAN, but on the recommendations of several monks, I had a lot of success using pdftotext, part of the XPDF open source project. It allows you to extract text from pdf to an ascii format,

    Your mileage may vary, of course, depending on what you want/need to do, but if you are doing text extractions, I heartily recommend pdftotext.

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

Re: PDF Modules Seeking Recommendations
by cosimo (Hermit) on Nov 24, 2006 at 17:42 UTC

    Check out also PDF::Reuse. Its source code is quite obscure and binary-stream oriented, but it does what it says. Allows to extract and insert text, images, barcodes, single pages, ...

    It has a module approach (many functions and no main object) rather than being OOP.

    In the end I found it didn't suit my needs, and I decided to contribute to PDF::ReportWriter, which does other things.

Re: PDF Modules Seeking Recommendations
by toma (Vicar) on Nov 25, 2006 at 19:24 UTC
    I have used another non-module approach: http://pdftohtml.sourceforge.net . It translates pdf to XML or HTML. The XML isn't valid, but it is not difficult to fix. This code is also based on xpdf.

    I like this approach because it gives me a bunch of text box strings with their bounding box coordinates, which I then sort by location. This is important for me because the documents that I parse tend to have an irregular 'document order.'

    I have also found pdf tips and tricks on the mostly commercial http://www.pdfzone.com site.

    It should work perfectly the first time! - toma

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://585782]
Approved by Joost
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2014-10-25 17:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (146 votes), past polls