Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I have had some success with pdftohtml in the past.

It wasn't easy though. The tool has 2 major modes: I can't remember exactly what the problem was with the html mode, but I ended up not using it at all. I used the xml mode, with a LOT of post processing (in Perl).

For starters the XML was not valid (i, b, u and a tags where not properly nested), so I had to disentangle them. Then what you get is a bunch of strings with their position on the page. From there I had to order them, merge them to create lines (sub/super scripts needed to be handled of course), and then create paragraphs... fun!

That was with version 0.36, the one that seems to come with most Linux distributions (it was released in 2002). Sourceforge has some more recent ("experimental") versions. I tried 0.40a, which produced a wildly different output, at least in xml mode, and gave up. The problem with version 0.36 is that it has problems with some recent pdf (version 1.6).

Overall it was quite painful, but in the end I managed to extract some information from the files.

Obbly enough I am currently using pdftotext for an other project, and it seems to be doing quite well, even though of course the output is simpler than what pdftohtml produces. I haven't noticed it dropping letters so far.

In reply to Re: Extracting text from PDF. No really by mirod
in thread Extracting text from PDF. No really by clinton

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2023-03-23 18:00 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (60 votes). Check out past polls.