Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Convert PDF file into HTML file

by oko1 (Deacon)
on Dec 22, 2010 at 15:06 UTC ( #878551=note: print w/replies, xml ) Need Help??

in reply to Convert PDF file into HTML file

As has been pointed out in a number of the excellent replies here, there's no reliable automatic way to do it because the information structures of PDF and HTML are incompatible. However, with a little human interaction and intelligence plugged into the system, it can be made to work (although it's not scalable.) 'pdftotext -layout' will extract the text, and 'pdfimages' will get the images. Once you have those, structuring either (or both) into a reasonable HTML approximation is relatively simple - but does require some thought and a little artistic judgement.

In the (narrow, specialized) case where you know that your PDFs are going to be nothing more than plain text, the process could be automated with "pdftotext -layout -htmlmeta file.pdf". This will produce an HTML file with a reasonable header and the content surrounded by 'pre' tags.

"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://878551]
and the radiator hisses contentedly...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2018-05-27 17:37 GMT
Find Nodes?
    Voting Booth?