Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: XPDF pdftotext page loop

by Anonymous Monk
on Oct 11, 2012 at 04:13 UTC ( #998355=note: print w/ replies, xml ) Need Help??


in reply to XPDF pdftotext page loop

??

chomp( my $number_of_pages = qx{ pdftotext ... multipage.pdf } );
chomp(
my( $number_of_pages ) =
  qx{ pdfinfo ... multipage.pdf }
  =~ /Pages:\s+(\d+)/
);


Comment on Re: XPDF pdftotext page loop
Download Code
Re^2: XPDF pdftotext page loop
by Anonymous Monk on Oct 11, 2012 at 04:13 UTC
Re^2: XPDF pdftotext page loop
by DST609 (Novice) on Oct 11, 2012 at 04:37 UTC
    Sorry, this is my first post, perhaps I should clarify. There are no page numberstext on the page and I have already used pdfinfo to retrieve the # of pages and I know how to count lines and run the basic regex I need to pull out my data. What I really need to know is how to know where page 1 ends and page 2 starts using xpdf and pdftotext.

      Since pdftotext defaults to inserting form feed characters between pages, you can examine each line for a form feed character as an indication of pagination:

      use strict; use warnings; my $i = 0; my $pageNum = 1; open my $fh, "pdftotext -layout multipage.pdf - |" or die $!; print "---------- Begin Page $pageNum ----------\n"; while ( my $line = <$fh> ) { if ( $line =~ /\xC/ ) { print "\n---------- End Page $pageNum ----------\n"; $pageNum++; print "---------- Begin Page $pageNum ----------\n"; } $i++; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close $fh;

      Another option which may serve you is using CAM::PDF:

      use strict; use warnings; use CAM::PDF; my $pdf = CAM::PDF->new('multipage.pdf'); for my $pageNumber ( 1 .. $pdf->numPages() ) { my $pageText = $pdf->getPageText($pageNumber); my @certainLines = ( split /\n/, $pageText )[ 9 .. 14 ]; print "---------- Lines 10 - 15 on Page $pageNumber ----------\n"; print +( join "\n", @certainLines ) . "\n"; print "---------- End Page $pageNumber ----------\n"; }

      The above shows how to grab a range of text lines from the converted pdf page. You may find, however, that pdftotext does a better rendering job.

      Hope this helps!

        That's great. you can't beat pdftotext for rendering the contents of the file being read but I appreciate the CAM::PDF example for a quick alternative that will work in some circumstances.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998355]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (18)
As of 2014-12-19 17:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (90 votes), past polls