Perl Monk, Perl Meditation | |
PerlMonks |
Example Of Using CAM::PDF Like HTML::TokeParserby Limbic~Region (Chancellor) |
on Oct 08, 2011 at 15:21 UTC ( [id://930360]=perlquestion: print w/replies, xml ) | Need Help?? |
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
All,
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.
I have done a fair amount of searching and came across two hints of a solution at Stack Overflow by the author of CAM::PDF. I have also emailed the author though I imagine he is quite busy actually having a life. Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using traverse()? Below is an example of how I create a parser using HTML::TokeParser
I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser: In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for. Cheers - L~R
Back to
Seekers of Perl Wisdom
|
|