Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Convert PDF file into HTML file

by oko1 (Deacon)
on Dec 22, 2010 at 15:06 UTC ( #878551=note: print w/ replies, xml ) Need Help??


in reply to Convert PDF file into HTML file

As has been pointed out in a number of the excellent replies here, there's no reliable automatic way to do it because the information structures of PDF and HTML are incompatible. However, with a little human interaction and intelligence plugged into the system, it can be made to work (although it's not scalable.) 'pdftotext -layout' will extract the text, and 'pdfimages' will get the images. Once you have those, structuring either (or both) into a reasonable HTML approximation is relatively simple - but does require some thought and a little artistic judgement.

In the (narrow, specialized) case where you know that your PDFs are going to be nothing more than plain text, the process could be automated with "pdftotext -layout -htmlmeta file.pdf". This will produce an HTML file with a reasonable header and the content surrounded by 'pre' tags.


--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf


Comment on Re: Convert PDF file into HTML file

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://878551]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-08-28 11:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (259 votes), past polls