Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: PDF Parsing

by Starky (Chaplain)
on Nov 28, 2007 at 15:31 UTC ( #653554=note: print w/ replies, xml ) Need Help??


in reply to PDF Parsing

I've used PDF::API2, and although it is an excellent parser, for large documents, it can be unwieldy because it seems to load / parse the entire document before you are able to do anything with it, chewing up great heaping gobs of memory in the process.

This can be problematic if your document is particularly large or if you have a large number of documents to parse.

(I experienced this issue firsthand when I had to parse and modify thousands of PDF files for a time-critical project and it took quite literally the better part of a weekend with two dedicated laptops churning away 24x7.)

Do any of the monks who've worked with CAM::PDF know whether it behaves the same way?


Comment on Re: PDF Parsing
Re^2: PDF Parsing
by Anonymous Monk on Dec 03, 2007 at 16:49 UTC
    Hi Starky! Can you give me an example how to parse PDF with PDF::API2? I want to find Xobjects and replace them . . . Would be great to here from you. Regards Alex
Re^2: PDF Parsing
by ademmler (Novice) on Dec 03, 2007 at 16:55 UTC
    Hi Starky! Again me, as a "known monk" - same question. Can you give me an example how to parse PDF with PDF::API2? I want to find Xobjects and replace them . . . Would be great to here from you. Regards Alex PS: Sorry for my confusing usage of this forum.
      Hi, figuring how to parse existing PDF files gave me headaches but reading PDF::API2::File's perldoc I figured it out. if you do something like my $foo = PDF::API2->open(bar.pdf);, the file structure is stored in $foo->{'pdf'}. Then you've got the Catalog (see pdf' specs) that you can parse to get objects indirect references (pages & annots or acroform) Once you've got an hash refering to the item you want to mess with you can use read_obj method like that : my $pdfapi = PDF::API2->open(foo.pdf); my $pdf = $pdfapi->{'pdf'}; my $object = $pdf->read_obj($indirect_reference_hashref);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://653554]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2014-07-29 12:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (217 votes), past polls