Convert PDF to HTML (or JPEG)

by ww
on Sep 12, 2009

in reply to Convert PDF to HTML (or JPEG)

I don't know if this will help, but have you evaluated SWISH::Filters::Pdf2HTML?

from CPAN:

- Perl extension for filtering PDF documents with Swish-e
This is a plug-in module that uses the xpdf package to convert PDF documents to html for indexing by Swish-e. Any info tags found in the PDF document are created as meta tags.
This filter plug-in requires the xpdf package


Re^2: Convert PDF to HTML (or JPEG)
by Sewi on Sep 12, 2009
    I tried xpdf some time ago when looking for the same problem and it seems that xpdf ignores pictures at all when converting :-(

      I'm not quite sure what you were expecting, README:

      Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

      man pdftotext:

      Pdftotext converts Portable Document Format (PDF) files to plain text. Pdftotext reads the PDF file, PDF-file, and writes a text file, text- file. If text-file is not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-’, the text is sent to stdout.

      man pdfimages:

      Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image,, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

      These utilities are not designed to output html with embeded images.


