Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: parsing a pdf with CAM::PDF

by diamondsandperls (Beadle)
on Jul 04, 2012 at 18:57 UTC ( #979905=note: print w/ replies, xml ) Need Help??


in reply to Re: parsing a pdf with CAM::PDF
in thread parsing a pdf with CAM::PDF

thanks also notice i needed to direct the print to the output_fh. The code still does not work though. This is printing to the output file

CAM::PDF::Node=HASH(0x28fa004)
CAM::PDF::Node=HASH(0x28af0f4)
CAM::PDF::Node=HASH(0x28f9a9c)

updated code:

#!perl use strict; use warnings; use CAM::PDF; my $output_file = 'test.txt'; my $filename = "view.pdf"; my @pdfStrings = ( qr/Source IP:.*(\d+.\d+.\d+.\d+)/, qr/Request URI: +(.*)/, qr/HOST: (.*)/ ); open(my $output_fh, '>', $output_file) or die "Failed to open $output_file - $!"; foreach my $pdfString (@pdfStrings) { my $doc = CAM::PDF->new($filename) || die "Unable to open $filename - + $!"; my $ascii = CAM::PDF->parseAny($pdfString); print {$output_fh} $ascii, "\n"; }


Comment on Re^2: parsing a pdf with CAM::PDF
Download Code
Re^3: parsing a pdf with CAM::PDF
by bulk88 (Priest) on Jul 05, 2012 at 01:03 UTC
    You got an object. Call a method on it, unless the docs say its overloaded, if its overloaded, try
    $plain = $overloaded."";
Re^3: parsing a pdf with CAM::PDF
by Athanasius (Monsignor) on Jul 05, 2012 at 04:43 UTC

    Hello diamondsandperls,

    There are a few problems in your updated code:

    (1) The regexes won’t work as you want. For example, in a regex a single dot matches any character (except newline). To get the literal dots in an IP address, you must backslash them: \. And for the URI and HOST, you want the capture to end at the first whitespace, so use \S+

    (2) No need to re-open the PDF file each time through the foreach loop.

    (3) As bulk88 pointed out, having created an object ($doc), you should call an instance method on it: $doc->parseAny($pdfString);

    (4) However, I’m not sure if that’s the method you want. From the module’s documentation, it appears getPageText might be the right choice.

    Applying these fixes to your code (and assuming the PDF document contains only 1 page):

    #! perl use strict; use warnings; use CAM::PDF; my $filename = 'view1.pdf'; my $output_file = 'test.txt'; my @pdfStrings = ( qr/Source IP:\s*(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1 +,3})/, qr/Request URI:\s*(\S+)/, qr/HOST:\s*(\S+)/, ); my $pdf = CAM::PDF->new($filename) or die "Cannot open '$filename' as a PDF file: $!"; my $doc = $pdf->getPageText(1); open(my $output_fh, '>', $output_file) or die "Failed to open file '$output_file' for writing: $!"; foreach my $search_string (@pdfStrings) { my ($find) = $doc =~ /$search_string/; print $output_fh $find, "\n" if $find; } close($output_fh) or die "Failed to close file '$output_file': $!";

    This is supposed to work. However, when I create a test PDF file using Word, I find that $pdf->getPageText(1) returns a string containing the text of the PDF file but with extra newlines inserted. (I cannot see any reason for this.) And these newlines can cause the regexes to fail. :-( But if your input PDF files are created differently, perhaps they won’t give rise to this problem?

    HTH,

    Athanasius <°(((><contra mundum

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://979905]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2014-12-28 08:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (179 votes), past polls