Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

How to invoke pdftotext and extract first line of text from PDF file?

by punch_card_don (Curate)
on Mar 28, 2010 at 22:54 UTC ( #831519=perlquestion: print w/ replies, xml ) Need Help??
punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mushrooming Monks,

After quite some Googling of "perl extract text from pdf" and a host of related permutations, I've concluded that good old pdftotext (part of XPDF) is the best way to go when all you want is a plain text string out of a PDF file. (Including a Perlmonks post by the author of perl module CAM saying so)

BUT - how, exactly, do I use it in my perl script?

That is, if I have foo.pdf in a directory on my server, and my own csript extract_text_from_pdf.pl in my cgi-bin, what is the correct syntax for

$content = some-pdftotext-command('foo.pdf'); <code> Then, what I really want is just the first line of text - that is, the + title at the top of the page - but I'm guessing it's not going to co +nveniently insert carriage returns that I can recognize - or will it? <p> So even better would be <code> @lines = = some-other-pdftotext-command('foo.pdf');
but let's not get unreasonable....

Thanks.




Time flies like an arrow. Fruit flies like a banana.

Comment on How to invoke pdftotext and extract first line of text from PDF file?
Download Code
Re: How to invoke pdftotext and extract first line of text from PDF file?
by punch_card_don (Curate) on Mar 28, 2010 at 23:18 UTC
    OK - figured it out - here's what worked for me:

    open (FILE, "pdftotext -layout foo.pdf - |"); $title = <FILE>; print "<p>$title\n"; close FILE;



    Time flies like an arrow. Fruit flies like a banana.
Re: How to invoke pdftotext and extract first line of text from PDF file?
by LanX (Canon) on Mar 28, 2010 at 23:23 UTC
    That's what I did:
    open ( my $fh, "-|","pdftotext -layout $file -") or die "error extracting $file";

    But I really recommend using pdftohtml -xml -stdout instead if you need more reliability about text position, page-number and font (-family, -size and -color) used.

    Cheers Rolf

      You can use backticks also:
      $text = `$Globals::pdftotext_bin -layout $pdffile -`; if ($?) { log_error(...) }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://831519]
Approved by toolic
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2014-12-17 22:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (40 votes), past polls