Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Convert PDF file into HTML file

by oko1 (Deacon)
on Dec 22, 2010 at 15:06 UTC ( #878551=note: print w/replies, xml ) Need Help??

in reply to Convert PDF file into HTML file

As has been pointed out in a number of the excellent replies here, there's no reliable automatic way to do it because the information structures of PDF and HTML are incompatible. However, with a little human interaction and intelligence plugged into the system, it can be made to work (although it's not scalable.) 'pdftotext -layout' will extract the text, and 'pdfimages' will get the images. Once you have those, structuring either (or both) into a reasonable HTML approximation is relatively simple - but does require some thought and a little artistic judgement.

In the (narrow, specialized) case where you know that your PDFs are going to be nothing more than plain text, the process could be automated with "pdftotext -layout -htmlmeta file.pdf". This will produce an HTML file with a reasonable header and the content surrounded by 'pre' tags.

"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://878551]
[Lady_Aleena]: Hello everyone. I'm having a blonde moment. I can push an array to an array, right? push @to_array, @another_array;

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2017-04-27 11:21 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (503 votes). Check out past polls.