Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

extract text from pdf

by jeteve (Pilgrim)
on Nov 08, 2006 at 12:30 UTC ( #582868=perlquestion: print w/ replies, xml ) Need Help??
jeteve has asked for the wisdom of the Perl Monks concerning the following question:

Hi wise monks.

I wonder what is the simpliest solution to extract text from pdf in perl.

Of course I can use pdftotext in command line, but it involves managing temporary files ..

So I'm looking for a pure perl solution (or linked to a C library)..

I had a look at PDF::API2 , but it's more dedicated to creation.

CAM::PDF seammt to fill my need, but I can't manage to use it to extract the text ..

I also had a look at SWISH, but it internally uses ... pdftotext :) ..

Any Idea ?

-- Nice photos of naked perl sources here !

Comment on extract text from pdf
Re: extract text from pdf
by fenLisesi (Priest) on Nov 08, 2006 at 12:54 UTC
    CAM::PDF was recommended in the earlier thread How toread the contents of PDF

    Update: I tried a few things with this module. It works well with some pdf files, but seems to fail in various ways for others. I couldn't get it to work with a few simple pdf files I created and exported from OpenOffice. The module comes with a small script named getpdftext.pl that may help you. Cheers.

Re: extract text from pdf
by mk. (Friar) on Nov 08, 2006 at 13:10 UTC
    have you tried File::Extract::PDF?!
    it uses CAM::PDF internally, but maybe you have better luck with it.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    *women.pm
      I did try both of those .. without success.

      I got a pdf I've created with openoffice and pdftotext is able to extract text from it, whereas CAM::PDF (or File::Extract::PDF) gives me messy characters.

      [jerome@saab pdf]$ getpdftext.pl -v ~/faxTaxHabitation2005.pdf                                                  ! " #  $  % # & ' ( "  ) * + + + ...
      And pdftotext:
      [jerome@saab pdf]$ pdftotext ~/faxTaxHabitation2005.pdf txt [jerome@saab pdf]$ tail txt Merci de bien vouloir me confirmer ces informations par retour de fax +afin que je puisse proceder au paiment le plus rapidement possible au + numero suivant : ************* Cordiales salutations. ...

      The ideal would be a perl module linked to the xpdf C code .. :)

      -- Nice photos of naked perl sources here !

Re: extract text from pdf
by caelifer (Scribe) on Nov 08, 2006 at 15:39 UTC
    Not really a Perl solution, but... Acrobat Reader 7 supports 'Save As Text' option. Why not to try this one out. Obviously, this wont work for documents made from images, but nothing short of OCR will.

    -BR

Re: extract text from pdf
by Anonymous Monk on Nov 08, 2006 at 16:02 UTC
    What do you mean "involved managing temporary files"?
    open $fh, "pdftotext whatever.pdf - |" or die; ... read text from $fh ...

      If I want just the PDFs text to use it for whatever (save it in a database, ...) I found this line quiete convenient:

      my $txt = `pdftotext whatever.pdf -` or die 'ERROR running pdftotext'; say $txt;
      Or if the file-name is in a variable and the PDF-file contains umlauts or other non-ascii chars:
      my $command_line = qq{pdftotext -enc 'UTF-8' '$path' -}; my $text = `$command_line` or die 'ERROR running pdftotext';

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://582868]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2014-09-18 02:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (105 votes), past polls