Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Convert PDF to HTML (or JPEG)

by Anonymous Monk
on Sep 12, 2009 at 07:47 UTC ( #794904=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, is there any Perl module to convert a PDF file to HTML? I found some external tools which do this, but all of them are Windows-only and I need to do it on a Linux box. The not-so-good-way would be PDF to JPEG, if there is no PDF to HTML solution. Thank you all!

Replies are listed 'Best First'.
Re: Convert PDF to HTML (or JPEG)
by almut (Canon) on Sep 12, 2009 at 12:31 UTC

    For PDF to JPG (or any other raster image format like PNG or TIFF), you could use GhostScript to do the conversion:

    $ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r150 -sOutputFile=i +mg%d.jpg input.pdf

    This would create as many images (img1.jpg to imgN.jpg) as there are pages in the PDF file.  -r is the resolution in dpi (150dpi would create an image size of 1240x1754 for A4 paper size), and -dJPEGQ is the quality factor (up to 100).

    Unfortunately, this doesn't do any anti-aliasing, so the fonts typically look rather ragged...  You can work around that problem by doing the anti-aliasing yourself; which means, you'd have to oversample while rendering from PDF to raster (e.g. by a factor of 4, i.e. 600dpi) and then downsample with an appropriate filter.

    ImageMagick's convert can be used for the latter. The complete sequence of steps would be:

    $ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r600 -sOutputFile=i +mg%d.jpg input.pdf $ for img in img*.jpg ; do convert $img -filter Lanczos -resize 25% -q +uality 90 out_$img ; done

    The resulting anti-aliased images out_img*.jpg would then have 150dpi resolution.

    In case you have the non-/usr/bin-namespace-polluting sister GraphicsMagick installed (instead of ImageMagick), the command would be gm convert ...

    (Those who hold a degree in Signal Processing - or have come in contact with filter design in some other context - might want to take a look at the list of filters to choose from — in case of doubt, stick with Lanczos or Kaiser for somewhat sharper, or Gaussian or Cubic for somewhat softer results.)

    Also, there's documentation - well hidden from daylight - under /usr/share/doc/ghostscript/Devices.htm, which explains what options are available with the individual Ghostscript output devices (you usually need to have another package installed (e.g. ghostscript-doc on Debian/Ubuntu) to have that file).

      Almut, IIRC convert has a switch for antialiasing, I never had problems converting PDF to bitmaps (well ... years ago)

      So no need for oversampling.

      Cheers Rolf

        Yes, convert has an -antialias switch, but not GhostScript — at least not the jpeg driver (there's an x11alpha screen driver, but I think that's the only one which does anti-aliasing by itself).  And ImageMagick (i.e. convert) cannot render PDF/PS itself; it uses GhostScript for that under the hood, anyway...

        Personally, I prefer to use both tools separately, because then I have fine control over the parameters used during conversion, and so far, I've always achieved better results (in less time) than when trying to convince convert alone to do what I want.

        For example, the naive approach (which I figure should be comparable to the conversions I posted above) when using convert directly would be something like this:

        $ convert input.pdf -density 150 -geometry 1240x1754 -antialias -quali +ty 90 img%d.jpg

        But the results are much worse than when doing the steps separately... (example: test1.jpg, test2.jpg — where test1.jpg has been produced by using gs and convert separately, and test2.jpg when calling gs indirectly via convert (the command right above)).

        As I read the docs, -density is supposed to set the resolution ("set resolution of an image for rendering to devices"), however, for some reason this doesn't seem to be passed on to Ghostscript (as can be revealed using strace)...  In case you have the patience to figure out the correct incantation of options for convert that achieves the quality of test1.jpg, please let me know (input PDF here) — IMHO, there's too much Magick going on :)

Re: Convert PDF to HTML (or JPEG) (How?)
by LanX (Chancellor) on Sep 12, 2009 at 10:25 UTC
    What kind of conversion do you expect?

    PDF is a printformat with fixed geometry and linebreaks. Each character is positioned individually, the bigger context is (per default) lost.

    (Normal) HTML defines texts (lines and paragraphs) which are flexibly drawn and broken dependent on the users display.

    Cheers Rolf

    UPDATE: you might want to look at solutions using xpdf-tools like pdf2html which produces HTML-files (+ massive CSS) with fixed positioned text... that's what you want?

      Oh, sorry, I didn't see the -c - switch which does exactly what I need. Thanks!
Re: Convert PDF to HTML (or JPEG)
by ww (Archbishop) on Sep 12, 2009 at 09:15 UTC
    I don't know if this will help, but have you evaluated SWISH::Filters::Pdf2HTML?

    from CPAN:

    - Perl extension for filtering PDF documents with Swish-e
    This is a plug-in module that uses the xpdf package to convert PDF documents to html for indexing by Swish-e. Any info tags found in the PDF document are created as meta tags.
    This filter plug-in requires the xpdf package
      I tried xpdf some time ago when looking for the same problem and it seems that xpdf ignores pictures at all when converting :-(

        I'm not quite sure what you were expecting, README:

        Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

        man pdftotext:

        Pdftotext converts Portable Document Format (PDF) files to plain text. Pdftotext reads the PDF file, PDF-file, and writes a text file, text- file. If text-file is not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-’, the text is sent to stdout.

        man pdfimages:

        Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image,, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

        These utilities are not designed to output html with embeded images.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://794904]
Approved by ww
and the questions are moot...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2017-08-21 05:36 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (317 votes). Check out past polls.