Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

XPDF pdftotext page loop

by DST609 (Novice)
on Oct 11, 2012 at 03:58 UTC ( #998352=perlquestion: print w/ replies, xml ) Need Help??
DST609 has asked for the wisdom of the Perl Monks concerning the following question:

I am using pdftotext to extract text from multipage pdf's and it works great. I need to look for certain lines counting from the top of each page. I can find out how many pages are in the pdf and I see that pdftotext has first page / last page parameters but was looking for a way to do this more efficiently then say, invoking multiple file handles on the same document (running a page loop with the pdftotext filehandle within the loop).

Below is the code I am using to extract the data of the entire document

my $i=0; open (FILE, "pdftotext -layout multipage.pdf - |"); while(<FILE>) { $i++; my($line) = $_; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close FILE;

Replies are listed 'Best First'.
Re: XPDF pdftotext page loop
by Anonymous Monk on Oct 11, 2012 at 04:13 UTC
    chomp( my $number_of_pages = qx{ pdftotext ... multipage.pdf } );
    my( $number_of_pages ) =
      qx{ pdfinfo ... multipage.pdf }
      =~ /Pages:\s+(\d+)/
      Sorry, this is my first post, perhaps I should clarify. There are no page numberstext on the page and I have already used pdfinfo to retrieve the # of pages and I know how to count lines and run the basic regex I need to pull out my data. What I really need to know is how to know where page 1 ends and page 2 starts using xpdf and pdftotext.

        Since pdftotext defaults to inserting form feed characters between pages, you can examine each line for a form feed character as an indication of pagination:

        use strict; use warnings; my $i = 0; my $pageNum = 1; open my $fh, "pdftotext -layout multipage.pdf - |" or die $!; print "---------- Begin Page $pageNum ----------\n"; while ( my $line = <$fh> ) { if ( $line =~ /\xC/ ) { print "\n---------- End Page $pageNum ----------\n"; $pageNum++; print "---------- Begin Page $pageNum ----------\n"; } $i++; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close $fh;

        Another option which may serve you is using CAM::PDF:

        use strict; use warnings; use CAM::PDF; my $pdf = CAM::PDF->new('multipage.pdf'); for my $pageNumber ( 1 .. $pdf->numPages() ) { my $pageText = $pdf->getPageText($pageNumber); my @certainLines = ( split /\n/, $pageText )[ 9 .. 14 ]; print "---------- Lines 10 - 15 on Page $pageNumber ----------\n"; print +( join "\n", @certainLines ) . "\n"; print "---------- End Page $pageNumber ----------\n"; }

        The above shows how to grab a range of text lines from the converted pdf page. You may find, however, that pdftotext does a better rendering job.

        Hope this helps!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998352]
Approved by GrandFather
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2016-07-24 16:27 GMT
Find Nodes?
    Voting Booth?
    What is your favorite alternate name for a (specific) keyboard key?

    Results (221 votes). Check out past polls.