Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Converting M$ Word --> PDF

by neuroball (Pilgrim)
on Jan 23, 2004 at 05:52 UTC ( #323457=note: print w/ replies, xml ) Need Help??


in reply to Converting M$ Word --> PDF

After a bit of research I found a solution that just might do it for you:

  • Download OpenOffice for whatever OS you would like to use and install it.
  • Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.
  • In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.

Btw. OpenOffice does automatically open Word files.

/oliver/


Comment on Re: Converting M$ Word --> PDF
Re: Re: Converting M$ Word --> PDF
by Roger (Parson) on Jan 23, 2004 at 06:03 UTC
    This is a good suggestion. Too bad neuroball beat me to it. :-)

    Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.

      Hi Roger,

      Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.

      No, my Office 2000 doesn't seem to have this, unless there is a service pack or other update I haven't installed ??

      Peter

        print to Postscript and then convert the Postscript to PDF using ghostscript.
        No, my Office 2000 doesn't seem to have this

        Go to your Printers control panel, create a New Printer, use the driver for the Apple LaserWriter (included with most if not all versions of Windows) and set it to print to file, rather than to a real printer. When you print to that "printer", it will ask you for a filename -- end the filename in ".ps" (stands for PostScript). These files can then be opened in GSView (or Acrobat Reader, I think) or converted to Acrobat's (un)Portable Document Format using ps2pdf.

        However, this method will require manual opening and printing-to-file of each and every document, which isn't what you asked for in your original post and is probably not what you really want to do, unless you're only doing a few documents. (If you *are* doing only a handful of documents, then it's an easy way out, which is why I explained it.)


        $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
M$ Word -->HTML-->PS--> PDF
by g00n (Hermit) on Jan 23, 2004 at 11:00 UTC
    You can get a similiar result exporting ms-word docs to html. Then using html2ps to convert the file to postscript. Convert the postscript to PDF using ps2pdf. something like ...
    • export file to html using MSWord/OO to say file.doc->file.html
    • using cygwin on windows (or copy file to *nix sys)
    • perl /usr/bin/html2ps file.html > file.ps
    • ps2pdf file.ps

    If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

    It works for text. But I have not tried text/graphics or plain graphics. Anyone had experience with graphics using this approach?

      Hi g00n,

      Thanks for your tips on how to go

      M$ Word -->HTML-->PS--> PDF

      If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

      I do have cygwin installed on the Win box, but I do have access to the Linux box at the website (shell) also. The less steps and less 'box changes' the better. My reply to "neuroball", the 3 steps is the ideal situation, but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

      Peter

        the problem

          but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

        the site that got me interested in pdf was Stas Beckmans site, www.stason.org. He gave a talk to the melbourne pm last year. Through the course of his talk on mod_perl 2 he showed the notes from his site in html with pdf downloads of the site.

        So I tried to re-create this html->ps->pdf so that I too could have a printable version of a project I'm working on called Ratpile (make a directory that has *stuff* stored in it searchable by stuffing information about it into a relational database - data mining some may call it.) using perl+DBI+TT2. The template I created is a *bare bones* html page sans images. This is the technique Stas is using with his docset.

        the point I guess I'm trying to make is I've used text only and not images. I've done a bit of research and this is what I've come up with...

        • graphics are supported in postscript (3?)
        • others better (ybiC) than I, have hacked together html->PS->PDF code and appears to handle images via html2ps but not html tables (Create PostScript and PDF versions of all HTML files in given directory )
        • one approach could be to use Matt Sergeants, PDFLib (load_image method) a oo wrapper around pdflib by www.pdflib.com. but I seem to remember has restrictions for use under OSI (has to be opensource, private use or researcher).
        • or use Alfred Reibenschuhs - Text::PDF::API where I found via an old page PDF-API2-0 which has some image (jpg,png,handleing capabilities
        • logreport has an interesting set of observations about html->PDF generation. Namely problems with html formatting and tables
        building html->PDF with images and troublesome html tables

        now given what we have found above I would suggest the following (unless anyone has a better idea) of using:

        • extract word document to html
        • extract table data (word document via OLE) or (via html via Html-TableExtract - like latter better.)
        • remove html tables in html documents
        • reinsert data into a simple table using <pre> tags for layout and html tags for bolding, emphasis. Or find some other method by experimentation in html for representing tables (text)
        • PDF-API2 as the PDF renderer. This can all be done in code.

        the real problem maybe rendering the tables generated from word. complicated layout in word (re-rendered to html) will have to be modified to the postscript syntax then rendered to PDF. The problem is defined by converting the html tables to pdf.

        it is not rocket science to create a bit of code to extract the data from the table, re-create a table using PDF-API (and its child modules).

        update: Perl Graphics Programming has 3 chapters devoted to PDF and perl, 1 specifically on PDF-API2.

        but is there a shorcut?

        of course you could forget all the above and take your chances with Michael Frankl's HTML-HTMLDOC and convert you html files directly to PDF :)

        credits

        damn I love cpan.

Re: Re: Converting M$ Word --> PDF
by peterr (Scribe) on Jan 23, 2004 at 23:43 UTC
    Hi neuroball,

    Download OpenOffice for whatever OS you would like to use and install it

    Okay, I only have Win95, so the 1.0.x version is the only one I can install.

    Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.

    Okay, will do. :)

    In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.

    Okay, I don't know how this all fits together with Perl (because I noticed 'ooolib' is a Perl library). No doubt OpenOffice must spawn a Perl process, I don't know ??

    Btw. OpenOffice does automatically open Word files

    I do have Word 2000, as part of Office Developer 2000 suite, but I can see it will not help me, but your solution will. The reason I need to do this is everytime a client wants me to update his catalogue on the website, I can change it in Word, but there is also a PDF catalog on the website, which of course also needs updating. He (the client) has the tools (Adobe) to convert the new catalog from Word --> PDF, but I don't. I usually have to ask him 10 to 15 times to convert it, even though it's a 5 min job. So, this gets rather a pain in the ..., after a while, and I would ideally like to do this:

    1. Have the 'catalog' on the website in HTML format.
    2. Use Perl to convert to PDF
    3. Use Perl to convert to Word

    I know I have seen a Perl module to do step 2, don't know if there is a Perl module to go HTML --> PDF though.

    Thanks, :)

    Peter

      Peter, you can just use the above concept also to covert from HTML to PDF. I have no idea where the limits are though.

      Just use perl to open the HTML file in OpenOffice and then make OpenOffice print it to PDF. I just tried it with a google page, and as long as the images are local, no problems should arise.

      If you want another way you might try the following:

      • Download HTMLdoc, which is GPL'ed, and install it.
      • Download HTML::HTMLdoc from CPAN and install it.
        1. Do some perl magic to get what you want...
        2. ...Unknown step...
        3. Profit!

      /oliver/

        Hi neuroball,

        you can just use the above concept also to covert from HTML to PDF. I have no idea where the limits are though.

        Yes, I think I saw a module on Cpan to do it. Now, if there was also a Perl module to do HTML --> Word, that would be great.

        Just use Perl to open the HTML file in OpenOffice and then make OpenOffice print it to PDF. I just tried it with a google page, and as long as the images are local, no problems should arise.

        Okay, I will try that, seeing I have just d/loaded OpenOffice, all 52 Mb. Hope it handles tables, word templates, images,etc, okay

        If you want another way you might try the following:

        • Download HTMLdoc, which is GPL'ed, and install it.
        • Download HTML::HTMLdoc from CPAN and install it.
          1. Do some perl magic to get what you want...
          2. ...Unknown step...
          3. Profit!

        Okay, I will try that also. Wow, I sure have enough things to try now, thanks everyone for your help. :)

        Peter

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://323457]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2014-09-20 20:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (163 votes), past polls