Need Help for Convert PDF to HTML

satzbu has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Need Help for Convert PDF to HTML by CountZero (Bishop) on Feb 09, 2011 at 07:16 UTC
I have always considered this to be a (near) mission impossible. It involves at least the following steps (each of which is a daunting task in itself): Parsing the PDF document. A module like CAM::PDF might help you here but it pre-supposes a good understanding of the internal structure of the PDF document and good knowledge of the PDF object model. Building an internal (Perl) data structure of the document; so you know what each pience is and where each piece goes on which page. Building the HTML page (and probably some CSS as well) to mimick the PDF-layout. This will be more difficult than one thinks as the HTML document format actually is very bad in placing "things" at exactly the spot you want. The whole idea of HTML (and CSS) is that the laoyout is "flowing" and will adapt itself (more or less) graciously to the output method of the client viewing it. So making an HTML page out of a PDF document is, like putting a square peg in a round hole: difficult to achieve in a clean way and it will never look good. Unless of course you cheat and embed the PDF-document in your HTML page. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: Need Help for Convert PDF to HTML by LanX (Saint) on Feb 12, 2011 at 00:40 UTC
> Building the HTML page (and probably some CSS as well) to mimick the PDF-layout. This will be more difficult than one thinks as the HTML document format actually is very bad in placing "things" at exactly the spot you want. The whole idea of HTML (and CSS) is that the laoyout is "flowing" and will adapt itself (more or less) graciously to the output method of the client viewing it. Actually most of this is solvable since CSS positioning was introduced (maybe 10 years ago?), the real problem is that arbitrary fonts are (in practice) not embeddable in HTML, and reconstructing words, lines and paragraphs with even slightly different font metrics looks awkward. For example some may remember how Google used to produce HTML-previews of PDFs, with those random gaps in the text lines. As I already said, it highly depends on the use case. (and on differing definitions of what HTML is) Cheers Rolf	[reply]
Re^3: Need Help for Convert PDF to HTML by CountZero (Bishop) on Feb 12, 2011 at 11:56 UTC
Actually most of this is solvable since CSS positioning was introduced (maybe 10 years ago?) Not even fixed or absolute positioning can guarantee you that the element will end up at the client's screen at exactly the place you thought you put it. Most of the time you end up with ugly scroll bars and overlapping elements or empty spots. And in any case I consider CSS positioning which takes the elements out of the normal flow an aberration if used to try to fix the layout, but YMMV. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: Need Help for Convert PDF to HTML by LanX (Saint) on Feb 12, 2011 at 12:40 UTC
Re: Need Help for Convert PDF to HTML by chrestomanci (Priest) on Feb 09, 2011 at 07:54 UTC
We had almost the exact same question back in December, with the same set of answers. Following up from what CountZero just said, and what I posted back then, this will always be a very hard problem because PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, column or foot note. Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons.	[reply]
Re: Need Help for Convert PDF to HTML by LanX (Saint) on Feb 09, 2011 at 10:52 UTC
The answer highly depends on the nature of your PDFs and the result you want! There is no simple answer for this general question, because a pure print format and a flowing format are different by nature. Even simple cases would need heuristics, but general solutions sophisticated artificial intelligence. This post lists some possibilities (especially pdftohtml -xml) and other corresponding discussions: Parsing PDFs by text position? Cheers Rolf	[reply]
Re: Need Help for Convert PDF to HTML by bart (Canon) on Feb 09, 2011 at 08:06 UTC
It's already a near impossible task if you know Perl and PDF very well. Talking about cheating... I suppose you're not converting other people's PDF files to html, because that would be, you know, evil... So if you're trying to convert your own PDF files to html, why not go one step earlier, and produce html from the source file, maybe even from within the program used to create the layout, from which the PDF is generated? Program control scripting (OLE, Applescript) might be an option.	[reply]
Re^2: Need Help for Convert PDF to HTML by jethro (Monsignor) on Feb 09, 2011 at 09:43 UTC
Converting other peoples PDF is evil?? Have you been drinking too much DMCA lately? ;-)	[reply]
Re: Need Help for Convert PDF to HTML by sundialsvc4 (Abbot) on Feb 10, 2011 at 01:42 UTC
Your problem is that PDF is a page description language which, when executed by the printer, generates the desired graphics as its output. HTML, too, is a page description language. Neither of these has predictable semantic meaning. Thus, there is no (AFAIK) generalized solution to your quest. On the other hand, maybe we can think outside the box here. If you can get to a source-document that is in an XML format, e.g. DocBook™, it is a simple matter to convert that either into PDF or HTML or both. So, is it possible for you to get your hands on such an (XML...) root document? That is to say, the source that the PDF you’re looking at came from? Many producers of technical documentation use content-management systems that are XML-based, and maybe, if you ask them very nicely, they’ll let you have a copy of the document in that format. Then, your objective would become quite trivial to achieve. Basically... if you sallied down the primrose path of trying to extract data from a PDF, you probably will never arrive anywhere useful no matter how hard you bang your head. But if this is indeed a serious business requirement, there is a reasonable chance that there is another way to get at that information...
Re: Need Help for Convert PDF to HTML by fenLisesi (Priest) on Feb 10, 2011 at 12:56 UTC
Importing into Google Docs and converting to HTML is one way of doing this. As monks have discussed, the HTML thus generated will not look exactly like the pdf. If you want to automate it, Google Docs has an API, and there seem to be CPAN modules for using that API, but I haven't studied whether the API supports the pdf => HTML conversion in which you are interested. Any experience with this, monks?	[reply]
Re: Need Help for Convert PDF to HTML by steve (Deacon) on Feb 11, 2011 at 16:11 UTC
Another difficulty I do not see listed among the replies here is the issue of embedded fonts. PDF documents allow for embedding of fonts, and HTML does not. If usage of non-standard (non-web) fonts is embedded in the source PDF, then extraction of the font becomes a significant challenge. Some tools are available to do just that. CAM::PDF can Extract Font Info from PDF, but when brian_d_foy asked about extracting the fonts themselves Chris Dolan intends to never add that feature. If you happen to have the font, that may be easier. It really depends on your source PDF document. CSS can be used to specify such fonts (see FontSpring "Bulletproof" Method, Smiley Variation among many). There are also licensing issues in play for many fonts. Depending on your circumstances (and perhaps the font requirements) this may be of concern/interest to you.	[reply]
Re^2: Need Help for Convert PDF to HTML by inman2787 (Initiate) on Mar 26, 2011 at 04:29 UTC
1. Convert PDF file to text file using Acrobat Reader or any program similiar. Just save it as a text file, no need for pro or extended versions of reader. 2. Open TextEdit.app, open up the text file you've created, copy/paste whole thing to a new document window. - Open Preferences in TextEdit - Go to the "Open/Save" Tab - Change Document Type to HTML Strict or XHTML strict depending on your needs. In Styling, select No CSS. - Go back and save the new document now as a html file. There is a step by step instruction on how to convert PDF to HTML. Hope that helps !	[reply]
Re^2: Need Help for Convert PDF to HTML by Anonymous Monk on Dec 31, 2011 at 15:25 UTC
HTML 5 has embedded fonts via JavaScript	[reply]


Perl-Sensitive Sunglasses
	PerlMonks