Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Convert PDF file into HTML file

by chrestomanci (Priest)
on Dec 22, 2010 at 11:53 UTC ( [id://878488]=note: print w/replies, xml ) Need Help??


in reply to Convert PDF file into HTML file

It will never be easy to convert PDF to HTML, because PDF can contain a lot more than HTML can, while at the same time PDF has a lot less structure.

HTML files usually have a linear structure that can easily be parsed. There are lots of tools to rendering them on screen or a paper printout. Converting HTML to PDF is easy, you just 'print' them to a PDF file. There are plenty of tools to do that.

PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, collum or foot note.

Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons.

Having said that, if your input PDF files are simple, you could consider converting them to SVG (A form of XML), using pdf2svg, (part of the inkscape toolset), and then converting that XML to HTML using standard CPAN modules, and your own heuristics.

Replies are listed 'Best First'.
Re^2: Convert PDF file into HTML file
by elef (Friar) on Dec 22, 2010 at 12:23 UTC
    Well said.

    This probably won't be any use, but here it goes anyway: pdftotext (part of the xpdf pdf viewer) can programmatically convert pdf to "formatted" txt. All it takes is system (\"pdftotext -layout -enc UTF-8 \"$infile\" \"$outfile\"") It approximates the original layout by inserting spaces in the txt.
    As you need HTML, you're probably better off with pdf2svg, this is just a note in case pdf2svg fails or whatever.
Re^2: Convert PDF file into HTML file
by ajguitarmaniac (Sexton) on Dec 22, 2010 at 12:44 UTC

    Hi chrestomanci, I do not have a solution to the topic under discussion but have another question for you since you seem to possess sound knowledge on the intricate structure of the PDF file. Anyways, the moment I saw this question, call it reflex, I googled the same, found a bunch of search results, websites that claim to convert PDF files to any desired format (including HTML). But websites claim that they can convert 'online PDFs" to HTML. Now is there a difference between the regular PDF file and these 'online PDFs'? Pardon me if my question is extremely silly but I really wanted to know this because there are a number of sites that I bumped into that claim can do the coversion under this discussion. Thanks.

      I did not think I was much of an expert on the internals of PDF. I had the insight to think of PDF as similar to postscript, and from that explained why perfect conversion is not possible.

      Online PDF will not be any different to normal PDF, those websites are simply referring to PDF files that are already downloadable on the web, which makes their conversion tools simpler.

      I had a look at a few online converters, and they mostly appear to be demos for paid apps that convert to other formats. You can't download a free executable to do the convertion on your own computer, you have to use the online tool, and see their ads.

      I also suspect that if you tried writing a script to use those online tools for bulk conversion, you would quickly find something preventing you such as a CAPTCHA, or a robots exclusion policy.

      In any case as I said before, the conversion will never be perfect. For an example of how far from perfect a PDF to HTML conversion can be, just click on "view as html" when google finds PDF files in a web search.

Re^2: Convert PDF file into HTML file
by bart (Canon) on Feb 08, 2011 at 12:10 UTC
    Oh, yeah, part of the fun of working with text from PDF is that, in order to nicely position the text on the page as for kerning (putting letters closer together to fill visual gaps between them) or justification (making spaces wider so the right side lines up to the margin), the PDF writer software may have cut up the text in small substrings and placed each on the page individually.

    It's up to you to puzzle the pieces back together again.

    Very rarely the text in PDF comes as one chunk.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://878488]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-18 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found