Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Convert PDF file into HTML file

by oko1 (Deacon)
on Dec 22, 2010 at 15:06 UTC ( #878551=note: print w/ replies, xml ) Need Help??


in reply to Convert PDF file into HTML file

As has been pointed out in a number of the excellent replies here, there's no reliable automatic way to do it because the information structures of PDF and HTML are incompatible. However, with a little human interaction and intelligence plugged into the system, it can be made to work (although it's not scalable.) 'pdftotext -layout' will extract the text, and 'pdfimages' will get the images. Once you have those, structuring either (or both) into a reasonable HTML approximation is relatively simple - but does require some thought and a little artistic judgement.

In the (narrow, specialized) case where you know that your PDFs are going to be nothing more than plain text, the process could be automated with "pdftotext -layout -htmlmeta file.pdf". This will produce an HTML file with a reasonable header and the content surrounded by 'pre' tags.


--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf


Comment on Re: Convert PDF file into HTML file

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://878551]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2015-07-30 07:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (270 votes), past polls