Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

PDF::API2 traversing object tree and parsing text

by RandomMonkey (Initiate)
on Sep 01, 2011 at 22:00 UTC ( [id://923730]=perlquestion: print w/replies, xml ) Need Help??

RandomMonkey has asked for the wisdom of the Perl Monks concerning the following question:

Does anybody have (or could write) a simple PDF::API2 main example that will traverse all the objects of an opened pdf file? I have spent some time now reading through the docs and googling for examples and I so far have either found no existing results (or have not discovered a proper google/PerlMonks search string).

More specifically, what I would like to do is open a pdf file and be able to traverse the (at least text) objects of the file and extract the info from each internal object (and now that I have identified these internal objects, be able to perhaps identify other meta information from the specific object). PDF::API2 appears to have this capability, but I have so far not been able to figure out how to do it.

I have already scoured Google and PerlMonks for information. I have found many really good threads discussing creating pdf documents. And even some threads that appear to answer my question, but all of the cited examples are no longer reachable.

I am still a bit hazy about how to even identify the display objects in the $pdf object. Ideally, I would like an example that opens a pdf with PDF::API2, shows how to identify/navigate through the object list/tree(?) and show how to extract text (and/or other information) from these internal objects.

Thanks in advance for any clues or help. :-D

  • Comment on PDF::API2 traversing object tree and parsing text

Replies are listed 'Best First'.
Re: PDF::API2 traversing object tree and parsing text
by chrestomanci (Priest) on Sep 02, 2011 at 09:09 UTC

    Where did you read that PDF::API2 is able to read pdf documents and extract content from them? I am not saying it is impossible, just that I can't see any suggestion that it is from the docs on CPAN, or in PDF::API2::HOWTO.

    Another approach you could take, especially if you just want the text from the PDF would be to convert it another format and parse that format with perl. For example a google search for pdf2svg returns an open source command line tool for the purpose, and also wikipedia instructions on how to manually convert using inkscape. As svg is an XML based format you should be able to find plenty of perl libraries and tutorials that will help you extract what you need.

      PDF::API2 - Facilitates the creation and modification of PDF files

      $pdf = PDF::API->open $pdffile

        Facilitates the creation and modification of PDF files

        I saw that in the PDF::API2 docs as well, however modification does not imply reading. It looks to me as if modification is limited to adding elements to and existing document, such as extra pages with new content, or overprinting existing pages with extra text or pictures.

        To give an analogy, this is like using a printing press to modify an existing printed document. You can print something else on the back, attach extra pages, or even overprint on the front, obliterating anything already there, but the press does not read the document and edit intelligently, it just adds to it.

        There are method calls in PDF::API2 to read metadata such as $pdf->preferences(%options), $pdf->default($parameter) and $pdf->info(%infohash) but I think the OP wants more than just metadata.

        As I say, I would be happy to be corrected, but as yet I have seen no evidence that PDF::API2 is able to read and process the contents of a PDF document.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://923730]
Approved by planetscape
Front-paged by chrestomanci
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-30 00:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found