Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Example Of Using CAM::PDF Like HTML::TokeParser

by Limbic~Region (Chancellor)
on Oct 08, 2011 at 15:21 UTC ( #930360=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.
  • Consume a node
  • Determine node type
  • Determine current state of parse
  • Dispatch a handler for the node based on type and current state

I have done a fair amount of searching and came across two hints of a solution at Stack Overflow by the author of CAM::PDF. I have also emailed the author though I imagine he is quite busy actually having a life.

Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using traverse()? Below is an example of how I create a parser using HTML::TokeParser

# Step 1: Dump the entire document while (my $tok = $p->get_token) { print Dumper($tok); }

I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser:

use constant TYPE => 0; use constant TEXT => 1; use constant TAG => 2; use constant ATTR => 3; while (my $tok = $p->get_token) { next if $tok->[TYPE] ne 'S' || $tok->[TAG] ne 'b' || ! $tok->[ATTR +]{class}; next if $tok->[ATTR]{class} ne 'secret'; my $next = $p->get_token; $wanted{password} = trim($next->[TEXT]); last; }
In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for.

Cheers - L~R

Replies are listed 'Best First'.
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by Khen1950fx (Canon) on Oct 08, 2011 at 23:25 UTC
    Would something like this help? I'm still trying to get a handle on it, and this is what I have so far.
    #!/usr/bin/perl use strict; use warnings; use Devel::SimpleTrace; use CAM::PDF; use CAM::PDF::Content; use CAM::PDF::PageText; use Data::Dumper::Concise; my $file = '/root/Desktop/sample1.pdf'; binmode STDOUT, ":encoding(utf8)"; my $pdf = CAM::PDF->new($file); for my $pagenum(1 .. $pdf->numPages) { my $contentTree = $pdf->getPageContentTree($pagenum) or next; $contentTree->validate() or die $@; print Dumper($contentTree->render('CAM::PDF::Renderer::Dump')); $pdf->setPageContent(2,$pagenum); last; }
      In short, yes. I am still playing but this was a significant step in the right direction. Please let me know what else you come up with.

      Cheers - L~R

        Here's what I have now. I borrowed hdump from the examples directory of HTML::Parser. Then I used CAM::PDF::GS to make a gs log file.
        #!/usr/bin/perl use strict; use warnings; use CAM::PDF; use Data::Dumper::Concise; use base qw(CAM::PDF::GS::NoText); my $file = shift @ARGV; my $log = '/root/Desktop/gs.log'; binmode STDOUT, ":encoding(utf8)"; open STDOUT, '>', $log; my $pdf = CAM::PDF->new($file); my $contentTree = $pdf->getPageContentTree(5); my $gs = $contentTree->computeGS; print Dumper($gs): close STDOUT;
        From the cmdline do
        perl /path/to/pdf
        Then I used hdump to examine gs.log:
        #!/usr/bin/perl -w use strict; use HTML::TokeParser; use Data::Dumper::Concise; $| = 1; sub h { my ( $event, $line, $column, $text, $tagname, $attr ) = @_; my (@d) = uc( substr( $event, 0, 1 ) ) . " L$line C$column"; substr( $text, 40 ) = "..." if length $text > 40; push @d, $text; push @d, $tagname if defined $tagname; push @d, $attr if $attr; print Dumper(@d); } my $p = HTML::Parser->new( api_version => 3 ); $p->handler( default => \&h, "event, line, column, text, tagname, attr +" ); $p->parse_file( @ARGV ? shift : *STDIN );
        From the cmdline:
        perl hdump /path/to/gs.log
        I hope that it's useful for you.
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by pvaldes (Chaplain) on Oct 08, 2011 at 21:37 UTC

    ok then,

    $po->traverse(1, $a_node_name, $function, $somedata);

    the first field after traverse is 1 (traverse this node) or 0 (don't do this, threat this link as "dead")

    The second field is the node name to apply

    The third is an action to do when you pass through this node, you can use here as argument several functions provided with the module.

     (i.e \&_changeRefKeysCB, \&_abbrevInlineImageCB, \&_changeStringCB or \&_getRefListCB)

    and fourth field is the data implied in this action (i.e $im_a_list)

    Hope this helps, bye

Re: Example Of Using CAM::PDF Like HTML::TokeParser
by pvaldes (Chaplain) on Oct 08, 2011 at 16:12 UTC

    if the pdf layout is the problem, maybe you want consider to use pdftotext playing a little with the layout option,

    `pdftotext -layout file.pdf file.txt`; `pdftotext file.pdf second_file.txt`;

    you can also extract only the desired pages of the pdf instead the whole file, making the search more easy

      As I indicated in my original post, extracting the text didn't work. What I didn't indicate is that I tried every possible tool and variation I could think of to include commercial products. None of the text extractions produce a consistent enough format for me to get at what I need. I understand that what I want to do is not ideal nor easy am may be futile - I however would like to try for myself.

      Cheers - L~R

Re: Example Of Using CAM::PDF Like HTML::TokeParser
by Anonymous Monk on Oct 11, 2011 at 12:36 UTC
    Can you use XPath expressions to zero-in more directly on the particular nodes you're looking for? "Writing programmed logic" to navigate an XML or HTML tree is akin to writing a recursive-descent compiler by hand instead of using YACC.
      Anonymous Monk,
      If you are referring to the non-existant PDF parser that this thread is about, then no. The internal structure of a PDF wouldn't lend itself to XPath diving.

      If you are referring to the way I go about creating an parser using HTML::TokeParser then the answer is "it depends". Node traversal is usually the last tool in the box I reach for. I am not even opposed to using regular expressions (*gasp*) if each page is consistent enough. It all depends on how consistent one page is to the next.

      Cheers - L~R

Re: Example Of Using CAM::PDF Like HTML::TokeParser
by thargas (Deacon) on Oct 11, 2011 at 18:53 UTC
    You may want to look at CAM::PDF::Renderer::Text. Although I'm sure you're not interested in its output format, it might be interesting as an example of getting the basic text/location info. You could use that and wire in your own functions to figure out what you want.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://930360]
Approved by Corion
Front-paged by davido
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2021-11-30 15:45 GMT
Find Nodes?
    Voting Booth?

    No recent polls found