Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Need Help for Convert PDF to HTML

by chrestomanci (Priest)
on Feb 09, 2011 at 07:54 UTC ( #887149=note: print w/ replies, xml ) Need Help??

in reply to Need Help for Convert PDF to HTML

We had almost the exact same question back in December, with the same set of answers.

Following up from what CountZero just said, and what I posted back then, this will always be a very hard problem because PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, column or foot note.

Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons.

Comment on Re: Need Help for Convert PDF to HTML

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887149]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2015-11-30 19:33 GMT
Find Nodes?
    Voting Booth?

    What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

    Results (778 votes), past polls