How to invoke pdftotext and extract first line of text from PDF file?

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mushrooming Monks,

After quite some Googling of "perl extract text from pdf" and a host of related permutations, I've concluded that good old pdftotext (part of XPDF) is the best way to go when all you want is a plain text string out of a PDF file. (Including a Perlmonks post by the author of perl module CAM saying so)

BUT - how, exactly, do I use it in my perl script?

That is, if I have foo.pdf in a directory on my server, and my own csript extract_text_from_pdf.pl in my cgi-bin, what is the correct syntax for

$content = some-pdftotext-command('foo.pdf');
<code>
Then, what I really want is just the first line of text - that is, the
+ title at the top of the page - but I'm guessing it's not going to co
+nveniently insert carriage returns that I can recognize - or will it?
<p>

So even better would be
<code>
@lines =  = some-other-pdftotext-command('foo.pdf');
[download]

but let's not get unreasonable....

Thanks.

Time flies like an arrow. Fruit flies like a banana.

Comment on How to invoke pdftotext and extract first line of text from PDF file? Download Code

Replies are listed 'Best First'.
Re: How to invoke pdftotext and extract first line of text from PDF file? by LanX (Saint) on Mar 28, 2010 at 23:23 UTC
That's what I did: `open ( my $fh, "-\|","pdftotext -layout $file -") or die "error extracting $file";` [download] But I really recommend using `pdftohtml -xml -stdout` instead if you need more reliability about text position, page-number and font (-family, -size and -color) used. Cheers Rolf	[reply] [d/l] [select]
Re^2: How to invoke pdftotext and extract first line of text from PDF file? by brycen (Monk) on Mar 29, 2010 at 05:21 UTC
You can use backticks also: $text = `$Globals::pdftotext_bin -layout $pdffile -`; if ($?) { log_error(...) } [download]	[reply] [d/l]
Re: How to invoke pdftotext and extract first line of text from PDF file? by punch_card_don (Curate) on Mar 28, 2010 at 23:18 UTC
OK - figured it out - here's what worked for me: `open (FILE, "pdftotext -layout foo.pdf - \|"); $title = <FILE>; print "<p>$title\n"; close FILE;` [download] Time flies like an arrow. Fruit flies like a banana.	[reply] [d/l]

Back to Seekers of Perl Wisdom