Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

PDF Parsing

by weismat (Friar)
on Nov 28, 2007 at 10:55 UTC ( #653511=perlquestion: print w/replies, xml ) Need Help??
weismat has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I would like to parse a rather simple, but large pdf file. I can copy and paste the content page wise, thus it does not contain images for the text. I looked at the PDF-API2 documentation and found it very unhandy. How would you approach to parse the text content a pdf document? Do you any hints I should look at? I found a lot to create, but nothing to parse PDF. Thanks!
Update: I want to stress out that no images are involved and I can use Window's copy and paste function. For the moment I have implemented an autoIt solution which creates a text file based on around 4000 copy and pastes. I would like to have a clean solution for the future.

Replies are listed 'Best First'.
Re: PDF Parsing
by marto (Bishop) on Nov 28, 2007 at 11:06 UTC
    Hi weismat,

    You may want to have a look at CAM::PDF which has some pretty good documentation and quite a few examples. If the PDF you are dealing with is made up of images, one per page (I know this his how some scanning software makes PDFs) you may want to have a read at Re: parse content of PDF file where I briefly mention PDF::OCR and tesseract.

    Hope this helps

Re: PDF Parsing
by dragonchild (Archbishop) on Nov 28, 2007 at 14:38 UTC
    marto is absolutely correct that you want to look at CAM::PDF. Take a look at stvn's Test::PDF for an example of how to use it.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: PDF Parsing
by Starky (Chaplain) on Nov 28, 2007 at 15:31 UTC
    I've used PDF::API2, and although it is an excellent parser, for large documents, it can be unwieldy because it seems to load / parse the entire document before you are able to do anything with it, chewing up great heaping gobs of memory in the process.

    This can be problematic if your document is particularly large or if you have a large number of documents to parse.

    (I experienced this issue firsthand when I had to parse and modify thousands of PDF files for a time-critical project and it took quite literally the better part of a weekend with two dedicated laptops churning away 24x7.)

    Do any of the monks who've worked with CAM::PDF know whether it behaves the same way?

      Hi Starky! Can you give me an example how to parse PDF with PDF::API2? I want to find Xobjects and replace them . . . Would be great to here from you. Regards Alex
      Hi Starky! Again me, as a "known monk" - same question. Can you give me an example how to parse PDF with PDF::API2? I want to find Xobjects and replace them . . . Would be great to here from you. Regards Alex PS: Sorry for my confusing usage of this forum.
        Hi, figuring how to parse existing PDF files gave me headaches but reading PDF::API2::File's perldoc I figured it out. if you do something like my $foo = PDF::API2->open(bar.pdf);, the file structure is stored in $foo->{'pdf'}. Then you've got the Catalog (see pdf' specs) that you can parse to get objects indirect references (pages & annots or acroform) Once you've got an hash refering to the item you want to mess with you can use read_obj method like that : my $pdfapi = PDF::API2->open(foo.pdf); my $pdf = $pdfapi->{'pdf'}; my $object = $pdf->read_obj($indirect_reference_hashref);
Re: PDF Parsing
by runrig (Abbot) on Nov 28, 2007 at 21:37 UTC
      I have tried the non-perl solution and unfortunately the output is different from the output of using the Windows clipboard and a lot more difficult to parse for the content which I need. Thanks for the suggestion anyway.
Re: PDF Parsing
by toma (Vicar) on Nov 30, 2007 at 07:39 UTC
    I have tried this a few different ways, and here is my favorite:

    Use pdftohtml with the -xml option:

    pdftohtml -xml file.pdf

    In pdftohtml-0.36, this creates invalid XML output. But it is easy to fix up this XML with a few regular expressions to create valid XML. Then use your favorite XML parser to process the XML. My favorite XML parser is Twig.

    It should work perfectly the first time! - toma

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://653511]
Approved by moritz
Front-paged by clinton
Jar. Jar!...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2017-03-24 23:17 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (310 votes). Check out past polls.