Re: Convert PDF file into HTML file

in reply to Convert PDF file into HTML file

It will never be easy to convert PDF to HTML, because PDF can contain a lot more than HTML can, while at the same time PDF has a lot less structure.

HTML files usually have a linear structure that can easily be parsed. There are lots of tools to rendering them on screen or a paper printout. Converting HTML to PDF is easy, you just 'print' them to a PDF file. There are plenty of tools to do that.

PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, collum or foot note.

Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons.

Having said that, if your input PDF files are simple, you could consider converting them to SVG (A form of XML), using pdf2svg, (part of the inkscape toolset), and then converting that XML to HTML using standard CPAN modules, and your own heuristics.

In Section Seekers of Perl Wisdom