Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re: Extracting text from PDF. No really

by mirod (Canon)
on Mar 28, 2008 at 13:16 UTC ( [id://676974]=note: print w/replies, xml ) Need Help??

in reply to Extracting text from PDF. No really

I have had some success with pdftohtml in the past.

It wasn't easy though. The tool has 2 major modes: I can't remember exactly what the problem was with the html mode, but I ended up not using it at all. I used the xml mode, with a LOT of post processing (in Perl).

For starters the XML was not valid (i, b, u and a tags where not properly nested), so I had to disentangle them. Then what you get is a bunch of strings with their position on the page. From there I had to order them, merge them to create lines (sub/super scripts needed to be handled of course), and then create paragraphs... fun!

That was with version 0.36, the one that seems to come with most Linux distributions (it was released in 2002). Sourceforge has some more recent ("experimental") versions. I tried 0.40a, which produced a wildly different output, at least in xml mode, and gave up. The problem with version 0.36 is that it has problems with some recent pdf (version 1.6).

Overall it was quite painful, but in the end I managed to extract some information from the files.

Obbly enough I am currently using pdftotext for an other project, and it seems to be doing quite well, even though of course the output is simpler than what pdftohtml produces. I haven't noticed it dropping letters so far.

  • Comment on Re: Extracting text from PDF. No really

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://676974]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-06-22 13:41 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.