Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Extracting text from PDF. No really

by clinton (Priest)
on Mar 28, 2008 at 11:56 UTC ( #676954=perlquestion: print w/ replies, xml ) Need Help??
clinton has asked for the wisdom of the Perl Monks concerning the following question:

Before you copy-paste a Super Search response, hear me out.

This question has been asked many times : how do you extract text from a PDF document? The answer is always "use CAM::PDF". And believe me, I've tried.

I have never yet seen it produce anything usable. Instead, it produces a string of meaningless characters. I've tried using the included utility getpdftext.pl on various PDFs, including a very simple one created from Open Office, and the result is always meaningless jumble.

Take this PDF for example. It is PDF version 1.3, not encrypted and not optimized. getpdftext.pl produces this:

7A P6. A.X+"NP>. +CRAPZ +CRMP + Q _ JU A ^0 A. GG e + [ m + /? ++Z @7> +>.M 6C@.N >7@7P.- + Q _ JU _ Z / _ Z" .-.A KMCK.MP7.N >7@7P.- - +8+Z" _ Z / K # ^ w ++"JZ? & JZ / G+ _m + ^ > +U _ // + #? JQQ m +G + _ # " _ // G+ A + ? w_ &/ Q+ 8 m^Z P A Z ++ + ^ 8 Z /A + ^ 8#/ _ / P G+ >_ ? + ^ 8#/& ~L 8 _ A& J"+ ~A + ? w_ & +/ Q+ 8 m^Z P A Z+ ^Z } : -_ / + } = / G" _ A ^8 " m # JQ 6GG 0 P JU+ +} e G r 1G _ U " Z A m+ #& ^Z G _ > JZ? _ ZJ Z / + # + &/ JZ / G+ & + +m # ^ w ++"JZ? & _ Z" JZ / +Z"JZ? / ^ _ mm+ _ #& G^ 8 Q" "^ & ^^ Z / +G+ _m ^ > + " _ / +r P G+ + Q _ JU _ Z /  && ^QJ w J / ^ #& _ # + X_ # " 6_ " _ ? _ A ^8 N +_ Z"? _ / + 6 ^ 8& + ~ e G6 L 8 _ A& J"+ ~A + ? w_ &/ Q+ 8 m^Z P A Z+ + ~A . e 1 -Y~ / +Q } G e[e 6G == GGG ~ # +8 } .4 r << r @7> e0er 6* : +e

pdftotext -layout from xpdf produces this:

IN THE NEWCASTLE COUNTY COURT Claim No 8NE00169 between MILLER HOMES LIMITED Claimant and EDEN PROPERTIES LIMITED Defendant Proceedings in the above matter will be heard at the ewcastle upon Tyne County Court at The Law ourts, Quayside, Newcastle upon Tyne on:&#8722; Date: 4th day of April 2008 Time: 10.30am ny person having an interest in these proceedings nd intending to appear should do so on the above ate. he Claimant’s solicitors are Ward Hadaway of andgate House, 102 Quayside, Newcastle upon yne, NE1 3DX, tel: 0191 204 4000, ref: F.JJ.MIL181.2751

... clearly dropping the first character on several lines. (It does this in -raw mode as well). And yet xpdf and evince both display this PDF correctly.

So where to from here? I need something that works on linux, preferably OOS, preferably Perl. The other PDF modules on CPAN (such as PDF::API2, PDF::Reuse and PDFLib seem to be intended for generating new PDFs, not extracting the contained text).

So what else can I try? Any suggestions?

Clint

Comment on Extracting text from PDF. No really
Select or Download Code
Re: Extracting text from PDF. No really
by Fletch (Chancellor) on Mar 28, 2008 at 12:05 UTC

    xpdf comes with a pdftotext which I've had fairly good luck with. It also is smart enough to extract and preserve (most) formatting (or at least most of what's been in what I've run through it . . . :). Perhaps install that if you don't already have it and open a pipe from it.

    Update: MENTAL NOTE: Wait until morning caffeine has taken effect enough for reading comprehension to function before attempting to solve problems. KTHXBAI.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Thanks Fletch, but you'll see that my second example in the root node already uses pdftotext, and it is dropping the first character on many lines. Yet xpdf displays the PDF correctly!
Re: Extracting text from PDF. No really
by mirod (Canon) on Mar 28, 2008 at 13:16 UTC

    I have had some success with pdftohtml in the past.

    It wasn't easy though. The tool has 2 major modes: I can't remember exactly what the problem was with the html mode, but I ended up not using it at all. I used the xml mode, with a LOT of post processing (in Perl).

    For starters the XML was not valid (i, b, u and a tags where not properly nested), so I had to disentangle them. Then what you get is a bunch of strings with their position on the page. From there I had to order them, merge them to create lines (sub/super scripts needed to be handled of course), and then create paragraphs... fun!

    That was with version 0.36, the one that seems to come with most Linux distributions (it was released in 2002). Sourceforge has some more recent ("experimental") versions. I tried 0.40a, which produced a wildly different output, at least in xml mode, and gave up. The problem with version 0.36 is that it has problems with some recent pdf (version 1.6).

    Overall it was quite painful, but in the end I managed to extract some information from the files.

    Obbly enough I am currently using pdftotext for an other project, and it seems to be doing quite well, even though of course the output is simpler than what pdftohtml produces. I haven't noticed it dropping letters so far.

Re: Extracting text from PDF. No really
by wazoox (Prior) on Mar 28, 2008 at 14:23 UTC
    pdftotext from poppler-0.6.4/ xpdf 3.02 gives a decent result for me:
    IN THE NEWCASTLE COUNTY COURT Claim No 8NE00169 between MILLER HOMES LIMITED Claimant and EDEN PROPERTIES LIMITED Defendant Proceedings in the above matter will be heard at the Newcastle upon Tyne County Court at The Law Courts, Quayside, Newcastle upon Tyne on:&#8722; Date: 4th day of April 2008 Time: 10.30am Any person having an interest in these proceedings and intending to appear should do so on the above date. The Claimant’s solicitors are Ward Hadaway of Sandgate House, 102 Quayside, Newcastle upon Tyne, NE1 3DX, tel: 0191 204 4000, ref: EF.JJ.MIL181.2751
      wazoox, you're a **star** - I had version 3.01 of xpdf installed - upgrading to 3.02 fixed that issue.

      many thanks!

        I can only concur - this utility is brilliant and works much better than any of the Perl modules I have come across so far. Thanks so much for bringing it up! I will investigate it in further detail from now on.

        Cheers -

        Pat
Re: Extracting text from PDF. No really
by chrisdolan (Beadle) on Mar 29, 2008 at 03:20 UTC

    I'm the author of CAM::PDF. Even under the best circumstances, getpdftext.pl produces barely readable output. My module doesn't have a renderer, so the text extraction is a total hack that I tossed into the module for fun.

    I'm quite pleased that other tools have produced good results! CAM::PDF (which I barely maintain anymore, I'm sorry to say) is optimized for high-performance, low-level editing of PDF documents.

      Thanks for responding, Chris. You'd be interested to know (as I mentioned in the OP), that the question "how do I extract text from a PDF" comes up a lot, and that the standard answer is always CAM::PDF.

      After your response, it seems that there is no Perl module for reading/rendering PDFs, and that about the only reliable OOS way to do it is via pdftotext from either Xpdf or Poppler.

      Clint

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://676954]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (14)
As of 2014-07-10 14:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (212 votes), past polls