Re^2: Convert PDF file into HTML file

by ajguitarmaniac (Sexton)
on Dec 22, 2010 at 12:44 UTC

in reply to Re: Convert PDF file into HTML file
in thread Convert PDF file into HTML file

Hi chrestomanci, I do not have a solution to the topic under discussion but have another question for you since you seem to possess sound knowledge on the intricate structure of the PDF file. Anyways, the moment I saw this question, call it reflex, I googled the same, found a bunch of search results, websites that claim to convert PDF files to any desired format (including HTML). But websites claim that they can convert 'online PDFs" to HTML. Now is there a difference between the regular PDF file and these 'online PDFs'? Pardon me if my question is extremely silly but I really wanted to know this because there are a number of sites that I bumped into that claim can do the coversion under this discussion. Thanks.

Replies are listed 'Best First'.
Re^3: Convert PDF file into HTML file
on Dec 22, 2010 at 13:14 UTC

    I did not think I was much of an expert on the internals of PDF. I had the insight to think of PDF as similar to postscript, and from that explained why perfect conversion is not possible.

    Online PDF will not be any different to normal PDF, those websites are simply referring to PDF files that are already downloadable on the web, which makes their conversion tools simpler.

    I had a look at a few online converters, and they mostly appear to be demos for paid apps that convert to other formats. You can't download a free executable to do the convertion on your own computer, you have to use the online tool, and see their ads.

    I also suspect that if you tried writing a script to use those online tools for bulk conversion, you would quickly find something preventing you such as a CAPTCHA, or a robots exclusion policy.

    In any case as I said before, the conversion will never be perfect. For an example of how far from perfect a PDF to HTML conversion can be, just click on "view as html" when google finds PDF files in a web search.

Node Type: note
