Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Need Help for Convert PDF to HTML

by CountZero (Bishop)
on Feb 09, 2011 at 07:16 UTC ( #887147=note: print w/ replies, xml ) Need Help??


in reply to Need Help for Convert PDF to HTML

I have always considered this to be a (near) mission impossible.

It involves at least the following steps (each of which is a daunting task in itself):

  • Parsing the PDF document. A module like CAM::PDF might help you here but it pre-supposes a good understanding of the internal structure of the PDF document and good knowledge of the PDF object model.
  • Building an internal (Perl) data structure of the document; so you know what each pience is and where each piece goes on which page.
  • Building the HTML page (and probably some CSS as well) to mimick the PDF-layout. This will be more difficult than one thinks as the HTML document format actually is very bad in placing "things" at exactly the spot you want. The whole idea of HTML (and CSS) is that the laoyout is "flowing" and will adapt itself (more or less) graciously to the output method of the client viewing it.

So making an HTML page out of a PDF document is, like putting a square peg in a round hole: difficult to achieve in a clean way and it will never look good.

Unless of course you cheat and embed the PDF-document in your HTML page.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James


Comment on Re: Need Help for Convert PDF to HTML
Re^2: Need Help for Convert PDF to HTML
by LanX (Canon) on Feb 12, 2011 at 00:40 UTC
    > Building the HTML page (and probably some CSS as well) to mimick the PDF-layout. This will be more difficult than one thinks as the HTML document format actually is very bad in placing "things" at exactly the spot you want. The whole idea of HTML (and CSS) is that the laoyout is "flowing" and will adapt itself (more or less) graciously to the output method of the client viewing it.

    Actually most of this is solvable since CSS positioning was introduced (maybe 10 years ago?), the real problem is that arbitrary fonts are (in practice) not embeddable in HTML, and reconstructing words, lines and paragraphs with even slightly different font metrics looks awkward.

    For example some may remember how Google used to produce HTML-previews of PDFs, with those random gaps in the text lines.

    As I already said, it highly depends on the use case. (and on differing definitions of what HTML is)

    Cheers Rolf

      Actually most of this is solvable since CSS positioning was introduced (maybe 10 years ago?)

      Not even fixed or absolute positioning can guarantee you that the element will end up at the client's screen at exactly the place you thought you put it. Most of the time you end up with ugly scroll bars and overlapping elements or empty spots.

      And in any case I consider CSS positioning which takes the elements out of the normal flow an aberration if used to try to fix the layout, but YMMV.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        I doubt this, I had excellent results converting DVI to HTML and this already 8 years ago on NN4 and IE5.

        Even when heuristics reproduced flowing text, with relative positioning of embedded formulas.

        But all of this only worked as long the same fonts were used.

        As I said, the positioning of elements work, the exact size of those elements is the problem.

        Cheers Rolf

Reaped: Re^2: Need Help for Convert PDF to HTML
by NodeReaper (Curate) on Feb 29, 2012 at 08:38 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887147]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-09-19 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (133 votes), past polls