http://www.perlmonks.org?node_id=979902

diamondsandperls has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse a pdf and print the ascii to a file for each match. Currently, I am getting this odd output on console nothing prints to the output file. I verified the data in the pdf can be copy and pasted out with highlighting.

current odd output:
(?-xism:Source IP:.*(\d+.\d+.\d+.\d+))
(?-xism:Request URI: (.*))
(?-xism:HOST: (.*))

#!perl use strict; use warnings; use CAM::PDF; my $output_file = 'test.txt'; my $filename = "view.pdf"; my @pdfStrings = ( qr/Source IP:.*(\d+.\d+.\d+.\d+)/, qr/Request URI: +(.*)/, qr/HOST: (.*)/ ); open(my $output_fh, '>', $output_file) or die "Failed to open $output_file - $!"; foreach my $pdfString (@pdfStrings) { my $doc = CAM::PDF->new($filename) || die "Unable to open $filename - + $!"; my $ascii = CAM::PDF->parseAny($pdfString); print $pdfString, "\n"; }

Replies are listed 'Best First'.
Re: parsing a pdf with CAM::PDF
by daxim (Curate) on Jul 04, 2012 at 18:50 UTC
    You are printing the Regexp objects in the array @pdfStrings. You likely meant to print $ascii.
      thanks also notice i needed to direct the print to the output_fh. The code still does not work though. This is printing to the output file

      CAM::PDF::Node=HASH(0x28fa004)
      CAM::PDF::Node=HASH(0x28af0f4)
      CAM::PDF::Node=HASH(0x28f9a9c)

      updated code:

      #!perl use strict; use warnings; use CAM::PDF; my $output_file = 'test.txt'; my $filename = "view.pdf"; my @pdfStrings = ( qr/Source IP:.*(\d+.\d+.\d+.\d+)/, qr/Request URI: +(.*)/, qr/HOST: (.*)/ ); open(my $output_fh, '>', $output_file) or die "Failed to open $output_file - $!"; foreach my $pdfString (@pdfStrings) { my $doc = CAM::PDF->new($filename) || die "Unable to open $filename - + $!"; my $ascii = CAM::PDF->parseAny($pdfString); print {$output_fh} $ascii, "\n"; }

        Hello diamondsandperls,

        There are a few problems in your updated code:

        (1) The regexes won’t work as you want. For example, in a regex a single dot matches any character (except newline). To get the literal dots in an IP address, you must backslash them: \. And for the URI and HOST, you want the capture to end at the first whitespace, so use \S+

        (2) No need to re-open the PDF file each time through the foreach loop.

        (3) As bulk88 pointed out, having created an object ($doc), you should call an instance method on it: $doc->parseAny($pdfString);

        (4) However, I’m not sure if that’s the method you want. From the module’s documentation, it appears getPageText might be the right choice.

        Applying these fixes to your code (and assuming the PDF document contains only 1 page):

        #! perl use strict; use warnings; use CAM::PDF; my $filename = 'view1.pdf'; my $output_file = 'test.txt'; my @pdfStrings = ( qr/Source IP:\s*(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1 +,3})/, qr/Request URI:\s*(\S+)/, qr/HOST:\s*(\S+)/, ); my $pdf = CAM::PDF->new($filename) or die "Cannot open '$filename' as a PDF file: $!"; my $doc = $pdf->getPageText(1); open(my $output_fh, '>', $output_file) or die "Failed to open file '$output_file' for writing: $!"; foreach my $search_string (@pdfStrings) { my ($find) = $doc =~ /$search_string/; print $output_fh $find, "\n" if $find; } close($output_fh) or die "Failed to close file '$output_file': $!";

        This is supposed to work. However, when I create a test PDF file using Word, I find that $pdf->getPageText(1) returns a string containing the text of the PDF file but with extra newlines inserted. (I cannot see any reason for this.) And these newlines can cause the regexes to fail. :-( But if your input PDF files are created differently, perhaps they won’t give rise to this problem?

        HTH,

        Athanasius <°(((><contra mundum

        You got an object. Call a method on it, unless the docs say its overloaded, if its overloaded, try
        $plain = $overloaded."";
Re: parsing a pdf with CAM::PDF
by Anonymous Monk on Jul 05, 2012 at 07:39 UTC

    Look at CAM::PDF documentation for parseAny, it clearly takes PDF as input, not a bunch of feeble attempts at regular expressions

    You want to use getpdftext.pl - Extracts and print the text from one or more PDF pages

    I'm beginning to think you're some kind of troll