Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

extract text from pdf

by jeteve (Pilgrim)
on Nov 08, 2006 at 12:30 UTC ( #582868=perlquestion: print w/replies, xml ) Need Help??
jeteve has asked for the wisdom of the Perl Monks concerning the following question:

Hi wise monks.

I wonder what is the simpliest solution to extract text from pdf in perl.

Of course I can use pdftotext in command line, but it involves managing temporary files ..

So I'm looking for a pure perl solution (or linked to a C library)..

I had a look at PDF::API2 , but it's more dedicated to creation.

CAM::PDF seammt to fill my need, but I can't manage to use it to extract the text ..

I also had a look at SWISH, but it internally uses ... pdftotext :) ..

Any Idea ?

-- Nice photos of naked perl sources here !

Replies are listed 'Best First'.
Re: extract text from pdf
by fenLisesi (Priest) on Nov 08, 2006 at 12:54 UTC
    CAM::PDF was recommended in the earlier thread How toread the contents of PDF

    Update: I tried a few things with this module. It works well with some pdf files, but seems to fail in various ways for others. I couldn't get it to work with a few simple pdf files I created and exported from OpenOffice. The module comes with a small script named that may help you. Cheers.

Re: extract text from pdf
by mk. (Friar) on Nov 08, 2006 at 13:10 UTC
    have you tried File::Extract::PDF?!
    it uses CAM::PDF internally, but maybe you have better luck with it.

      I did try both of those .. without success.

      I got a pdf I've created with openoffice and pdftotext is able to extract text from it, whereas CAM::PDF (or File::Extract::PDF) gives me messy characters.

      [jerome@saab pdf]$ -v ~/faxTaxHabitation2005.pdf                                                  ! " #  $  % # & ' ( "  ) * + + + ...
      And pdftotext:
      [jerome@saab pdf]$ pdftotext ~/faxTaxHabitation2005.pdf txt [jerome@saab pdf]$ tail txt Merci de bien vouloir me confirmer ces informations par retour de fax +afin que je puisse proceder au paiment le plus rapidement possible au + numero suivant : ************* Cordiales salutations. ...

      The ideal would be a perl module linked to the xpdf C code .. :)

      -- Nice photos of naked perl sources here !

Re: extract text from pdf
by Anonymous Monk on Nov 08, 2006 at 16:02 UTC
    What do you mean "involved managing temporary files"?
    open $fh, "pdftotext whatever.pdf - |" or die; ... read text from $fh ...

      If I want just the PDFs text to use it for whatever (save it in a database, ...) I found this line quiete convenient:

      my $txt = `pdftotext whatever.pdf -` or die 'ERROR running pdftotext'; say $txt;
      Or if the file-name is in a variable and the PDF-file contains umlauts or other non-ascii chars:
      my $command_line = qq{pdftotext -enc 'UTF-8' '$path' -}; my $text = `$command_line` or die 'ERROR running pdftotext';
Re: extract text from pdf
by caelifer (Scribe) on Nov 08, 2006 at 15:39 UTC
    Not really a Perl solution, but... Acrobat Reader 7 supports 'Save As Text' option. Why not to try this one out. Obviously, this wont work for documents made from images, but nothing short of OCR will.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://582868]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2018-04-21 22:11 GMT
Find Nodes?
    Voting Booth?