Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Need Help for Convert PDF to HTML

by chrestomanci (Priest)
on Feb 09, 2011 at 07:54 UTC ( #887149=note: print w/ replies, xml ) Need Help??


in reply to Need Help for Convert PDF to HTML

We had almost the exact same question back in December, with the same set of answers.

Following up from what CountZero just said, and what I posted back then, this will always be a very hard problem because PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, column or foot note.

Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons.


Comment on Re: Need Help for Convert PDF to HTML

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887149]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2014-10-26 07:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (152 votes), past polls