Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: XPDF pdftotext page loop

by Anonymous Monk
on Oct 11, 2012 at 04:13 UTC ( #998355=note: print w/ replies, xml ) Need Help??


in reply to XPDF pdftotext page loop

??

chomp( my $number_of_pages = qx{ pdftotext ... multipage.pdf } );
chomp(
my( $number_of_pages ) =
  qx{ pdfinfo ... multipage.pdf }
  =~ /Pages:\s+(\d+)/
);


Comment on Re: XPDF pdftotext page loop
Download Code
Re^2: XPDF pdftotext page loop
by Anonymous Monk on Oct 11, 2012 at 04:13 UTC
Re^2: XPDF pdftotext page loop
by DST609 (Novice) on Oct 11, 2012 at 04:37 UTC
    Sorry, this is my first post, perhaps I should clarify. There are no page numberstext on the page and I have already used pdfinfo to retrieve the # of pages and I know how to count lines and run the basic regex I need to pull out my data. What I really need to know is how to know where page 1 ends and page 2 starts using xpdf and pdftotext.

      Since pdftotext defaults to inserting form feed characters between pages, you can examine each line for a form feed character as an indication of pagination:

      use strict; use warnings; my $i = 0; my $pageNum = 1; open my $fh, "pdftotext -layout multipage.pdf - |" or die $!; print "---------- Begin Page $pageNum ----------\n"; while ( my $line = <$fh> ) { if ( $line =~ /\xC/ ) { print "\n---------- End Page $pageNum ----------\n"; $pageNum++; print "---------- Begin Page $pageNum ----------\n"; } $i++; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close $fh;

      Another option which may serve you is using CAM::PDF:

      use strict; use warnings; use CAM::PDF; my $pdf = CAM::PDF->new('multipage.pdf'); for my $pageNumber ( 1 .. $pdf->numPages() ) { my $pageText = $pdf->getPageText($pageNumber); my @certainLines = ( split /\n/, $pageText )[ 9 .. 14 ]; print "---------- Lines 10 - 15 on Page $pageNumber ----------\n"; print +( join "\n", @certainLines ) . "\n"; print "---------- End Page $pageNumber ----------\n"; }

      The above shows how to grab a range of text lines from the converted pdf page. You may find, however, that pdftotext does a better rendering job.

      Hope this helps!

        That's great. you can't beat pdftotext for rendering the contents of the file being read but I appreciate the CAM::PDF example for a quick alternative that will work in some circumstances.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998355]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2015-07-06 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (72 votes), past polls