parsing a pdf with CAM::PDF

diamondsandperls has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse a pdf and print the ascii to a file for each match. Currently, I am getting this odd output on console nothing prints to the output file. I verified the data in the pdf can be copy and pasted out with highlighting.

current odd output:
(?-xism:Source IP:.*(\d+.\d+.\d+.\d+))
(?-xism:Request URI: (.*))
(?-xism:HOST: (.*))

#!perl

use strict;
use warnings;
use CAM::PDF;

my $output_file = 'test.txt';

my $filename = "view.pdf";
my @pdfStrings = ( qr/Source IP:.*(\d+.\d+.\d+.\d+)/, qr/Request URI: 
+(.*)/, qr/HOST: (.*)/ );


open(my $output_fh, '>', $output_file)
    or die "Failed to open $output_file - $!";


foreach my $pdfString (@pdfStrings) {    

my $doc = CAM::PDF->new($filename) || die "Unable to open $filename  -
+ $!"; 
my $ascii = CAM::PDF->parseAny($pdfString);

print $pdfString, "\n";
}
[download]

Comment on parsing a pdf with CAM::PDF Download Code

Replies are listed 'Best First'.
Re: parsing a pdf with CAM::PDF by daxim (Curate) on Jul 04, 2012 at 18:50 UTC
You are printing the Regexp objects in the array `@pdfStrings`. You likely meant to print `$ascii`.	[reply]
Re^2: parsing a pdf with CAM::PDF by diamondsandperls (Beadle) on Jul 04, 2012 at 18:57 UTC
thanks also notice i needed to direct the print to the output_fh. The code still does not work though. This is printing to the output file CAM::PDF::Node=HASH(0x28fa004) CAM::PDF::Node=HASH(0x28af0f4) CAM::PDF::Node=HASH(0x28f9a9c) updated code: `#!perl use strict; use warnings; use CAM::PDF; my $output_file = 'test.txt'; my $filename = "view.pdf"; my @pdfStrings = ( qr/Source IP:.(\d+.\d+.\d+.\d+)/, qr/Request URI: +(.)/, qr/HOST: (.*)/ ); open(my $output_fh, '>', $output_file) or die "Failed to open $output_file - $!"; foreach my $pdfString (@pdfStrings) { my $doc = CAM::PDF->new($filename) \|\| die "Unable to open $filename - + $!"; my $ascii = CAM::PDF->parseAny($pdfString); print {$output_fh} $ascii, "\n"; }` [download]	[reply] [d/l]
Re^3: parsing a pdf with CAM::PDF by Athanasius (Archbishop) on Jul 05, 2012 at 04:43 UTC
Hello diamondsandperls, There are a few problems in your updated code: (1) The regexes won’t work as you want. For example, in a regex a single dot matches any character (except newline). To get the literal dots in an IP address, you must backslash them: `\.` And for the URI and HOST, you want the capture to end at the first whitespace, so use `\S+` (2) No need to re-open the PDF file each time through the `foreach` loop. (3) As bulk88 pointed out, having created an object (`$doc`), you should call an instance method on it: `$doc->parseAny($pdfString);` (4) However, I’m not sure if that’s the method you want. From the module’s documentation, it appears `getPageText` might be the right choice. Applying these fixes to your code (and assuming the PDF document contains only 1 page): #! perl use strict; use warnings; use CAM::PDF; my $filename = 'view1.pdf'; my $output_file = 'test.txt'; my @pdfStrings = ( qr/Source IP:\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1 +,3})/, qr/Request URI:\s(\S+)/, qr/HOST:\s(\S+)/, ); my $pdf = CAM::PDF->new($filename) or die "Cannot open '$filename' as a PDF file: $!"; my $doc = $pdf->getPageText(1); open(my $output_fh, '>', $output_file) or die "Failed to open file '$output_file' for writing: $!"; foreach my $search_string (@pdfStrings) { my ($find) = $doc =~ /$search_string/; print $output_fh $find, "\n" if $find; } close($output_fh) or die "Failed to close file '$output_file': $!"; [download] This is supposed to work. However, when I create a test PDF file using Word, I find that `$pdf->getPageText(1)` returns a string containing the text of the PDF file but with extra newlines inserted. (I cannot see any reason for this.) And these newlines can cause the regexes to fail. `:-(` But if your input PDF files are created differently, perhaps they won’t give rise to this problem? HTH, Athanasius <°(((>< contra mundum*	[reply] [d/l]
Re^3: parsing a pdf with CAM::PDF by bulk88 (Priest) on Jul 05, 2012 at 01:03 UTC
You got an object. Call a method on it, unless the docs say its overloaded, if its overloaded, try `$plain = $overloaded."";` [download]	[reply] [d/l]
Re: parsing a pdf with CAM::PDF by Anonymous Monk on Jul 05, 2012 at 07:39 UTC
Look at CAM::PDF documentation for parseAny, it clearly takes PDF as input, not a bunch of feeble attempts at regular expressions You want to use getpdftext.pl - Extracts and print the text from one or more PDF pages I'm beginning to think you're some kind of troll	[reply]

Back to Seekers of Perl Wisdom