Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

XPDF pdftotext page loop

by DST609 (Novice)
on Oct 11, 2012 at 03:58 UTC ( #998352=perlquestion: print w/replies, xml ) Need Help??
DST609 has asked for the wisdom of the Perl Monks concerning the following question:

I am using pdftotext to extract text from multipage pdf's and it works great. I need to look for certain lines counting from the top of each page. I can find out how many pages are in the pdf and I see that pdftotext has first page / last page parameters but was looking for a way to do this more efficiently then say, invoking multiple file handles on the same document (running a page loop with the pdftotext filehandle within the loop).

Below is the code I am using to extract the data of the entire document

my $i=0; open (FILE, "pdftotext -layout multipage.pdf - |"); while(<FILE>) { $i++; my($line) = $_; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close FILE;

Replies are listed 'Best First'.
Re: XPDF pdftotext page loop
by Anonymous Monk on Oct 11, 2012 at 04:13 UTC
    chomp( my $number_of_pages = qx{ pdftotext ... multipage.pdf } );
    my( $number_of_pages ) =
      qx{ pdfinfo ... multipage.pdf }
      =~ /Pages:\s+(\d+)/
      Sorry, this is my first post, perhaps I should clarify. There are no page numberstext on the page and I have already used pdfinfo to retrieve the # of pages and I know how to count lines and run the basic regex I need to pull out my data. What I really need to know is how to know where page 1 ends and page 2 starts using xpdf and pdftotext.

        Since pdftotext defaults to inserting form feed characters between pages, you can examine each line for a form feed character as an indication of pagination:

        use strict; use warnings; my $i = 0; my $pageNum = 1; open my $fh, "pdftotext -layout multipage.pdf - |" or die $!; print "---------- Begin Page $pageNum ----------\n"; while ( my $line = <$fh> ) { if ( $line =~ /\xC/ ) { print "\n---------- End Page $pageNum ----------\n"; $pageNum++; print "---------- Begin Page $pageNum ----------\n"; } $i++; print "\n<div class=\"line\"><div>$i</div>$line</div>"; } close $fh;

        Another option which may serve you is using CAM::PDF:

        use strict; use warnings; use CAM::PDF; my $pdf = CAM::PDF->new('multipage.pdf'); for my $pageNumber ( 1 .. $pdf->numPages() ) { my $pageText = $pdf->getPageText($pageNumber); my @certainLines = ( split /\n/, $pageText )[ 9 .. 14 ]; print "---------- Lines 10 - 15 on Page $pageNumber ----------\n"; print +( join "\n", @certainLines ) . "\n"; print "---------- End Page $pageNumber ----------\n"; }

        The above shows how to grab a range of text lines from the converted pdf page. You may find, however, that pdftotext does a better rendering job.

        Hope this helps!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998352]
Approved by GrandFather
[1nickt]: That's why I asked if you are using DateTime. It has a large number of supporting modules (the author likes the term 'eco-system') so if you are already creating a DateTime obj from your dates, this module would read in the mnoron-formatted 1s seamlessly
[TCLion]: when I put the date together it looks like : 2017-Feb-24 (month is the problem)
[1nickt]: good luck, then.
[TCLion]: looking at DateTime documentation in monastery now

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (11)
As of 2017-03-23 15:00 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (288 votes). Check out past polls.