PDF Modules Seeking Recommendations

by mitd (Curate)
I am about to start a project that involves the parsing and extracting of information from PDF documents and it has been a while since I visited the CPAN PDF moudule pool.

Any recommendations from the monks?

by marto (Archbishop) on Nov 24, 2006 at 10:13 UTC

    In the past I have used CAM::PDF to deal with PDF files. I think it depends on exactly what information you wish to parse/extract, feel free to provide detailed examples of what you are trying to achieve. Check out the documentation and examples that are provided with CAM::PDF.

by tbone1 (Monsignor) on Nov 24, 2006 at 12:46 UTC
    It's not part of CPAN, but on the recommendations of several monks, I had a lot of success using pdftotext, part of the XPDF open source project. It allows you to extract text from pdf to an ascii format,

    Your mileage may vary, of course, depending on what you want/need to do, but if you are doing text extractions, I heartily recommend pdftotext.

by cosimo (Hermit) on Nov 24, 2006 at 17:42 UTC

    Check out also PDF::Reuse. Its source code is quite obscure and binary-stream oriented, but it does what it says. Allows to extract and insert text, images, barcodes, single pages, ...

    It has a module approach (many functions and no main object) rather than being OOP.

    In the end I found it didn't suit my needs, and I decided to contribute to PDF::ReportWriter, which does other things.

by toma (Vicar) on Nov 25, 2006 at 19:24 UTC
    I have used another non-module approach: . It translates pdf to XML or HTML. The XML isn't valid, but it is not difficult to fix. This code is also based on xpdf.

    I like this approach because it gives me a bunch of text box strings with their bounding box coordinates, which I then sort by location. This is important for me because the documents that I parse tend to have an irregular 'document order.'

    I have also found pdf tips and tricks on the mostly commercial site.

