Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Converting M$ Word --> PDF

by peterr (Scribe)
on Jan 23, 2004 at 05:19 UTC ( #323450=perlquestion: print w/ replies, xml ) Need Help??
peterr has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to convert a M$ Word document to (Adobe) PDF format please. :)

I did see a node on doing this here, but it appears to me, that I would need to run this on a Win box, is that correct ? Ideally, I would like to simply run a Perl script on the Linux box (Perl 5.8.1), enter the input filename (Word doc), press a button, and the script converts to PDF format.

However, if I can only do this on a Win box, using OLE (and I do have Word installed, plus Active State Perl 5.8.? ), then no doubt that will have to do. Can someone please explain the modules needed to do this (I have serached cpan), and the limitations there may be.

Thanks, :)

Peter

Edited by BazB fix link (HTML -> PM link).

Comment on Converting M$ Word --> PDF
Re: Converting M$ Word --> PDF
by neuroball (Pilgrim) on Jan 23, 2004 at 05:52 UTC

    After a bit of research I found a solution that just might do it for you:

    • Download OpenOffice for whatever OS you would like to use and install it.
    • Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.
    • In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.

    Btw. OpenOffice does automatically open Word files.

    /oliver/

      This is a good suggestion. Too bad neuroball beat me to it. :-)

      Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.

        Hi Roger,

        Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.

        No, my Office 2000 doesn't seem to have this, unless there is a service pack or other update I haven't installed ??

        Peter

      You can get a similiar result exporting ms-word docs to html. Then using html2ps to convert the file to postscript. Convert the postscript to PDF using ps2pdf. something like ...
      • export file to html using MSWord/OO to say file.doc->file.html
      • using cygwin on windows (or copy file to *nix sys)
      • perl /usr/bin/html2ps file.html > file.ps
      • ps2pdf file.ps

      If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

      It works for text. But I have not tried text/graphics or plain graphics. Anyone had experience with graphics using this approach?

        Hi g00n,

        Thanks for your tips on how to go

        M$ Word -->HTML-->PS--> PDF

        If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

        I do have cygwin installed on the Win box, but I do have access to the Linux box at the website (shell) also. The less steps and less 'box changes' the better. My reply to "neuroball", the 3 steps is the ideal situation, but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

        Peter

      Hi neuroball,

      Download OpenOffice for whatever OS you would like to use and install it

      Okay, I only have Win95, so the 1.0.x version is the only one I can install.

      Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.

      Okay, will do. :)

      In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.

      Okay, I don't know how this all fits together with Perl (because I noticed 'ooolib' is a Perl library). No doubt OpenOffice must spawn a Perl process, I don't know ??

      Btw. OpenOffice does automatically open Word files

      I do have Word 2000, as part of Office Developer 2000 suite, but I can see it will not help me, but your solution will. The reason I need to do this is everytime a client wants me to update his catalogue on the website, I can change it in Word, but there is also a PDF catalog on the website, which of course also needs updating. He (the client) has the tools (Adobe) to convert the new catalog from Word --> PDF, but I don't. I usually have to ask him 10 to 15 times to convert it, even though it's a 5 min job. So, this gets rather a pain in the ..., after a while, and I would ideally like to do this:

      1. Have the 'catalog' on the website in HTML format.
      2. Use Perl to convert to PDF
      3. Use Perl to convert to Word

      I know I have seen a Perl module to do step 2, don't know if there is a Perl module to go HTML --> PDF though.

      Thanks, :)

      Peter

        Peter, you can just use the above concept also to covert from HTML to PDF. I have no idea where the limits are though.

        Just use perl to open the HTML file in OpenOffice and then make OpenOffice print it to PDF. I just tried it with a google page, and as long as the images are local, no problems should arise.

        If you want another way you might try the following:

        • Download HTMLdoc, which is GPL'ed, and install it.
        • Download HTML::HTMLdoc from CPAN and install it.
          1. Do some perl magic to get what you want...
          2. ...Unknown step...
          3. Profit!

        /oliver/

Re: Converting M$ Word --> PDF
by CountZero (Bishop) on Jan 23, 2004 at 06:48 UTC
    ADOBE allows you to have 5 documents converted to PDF for free here.

    And of course one could think of buying and installing Acrobat Professional, which allows you to make your own PDF-files.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      CountZero,

      Thanks for the link to the 5 free conversions. That might be a good short term solution. :)

      And of course one could think of buying and installing Acrobat Professional, which allows you to make your own PDF-files.

      Some severe health problems == no cash to splash on such tools. :(

      Peter

Re: Converting M$ Word --> PDF
by cfreak (Chaplain) on Jan 23, 2004 at 17:01 UTC

    I've done this for a customer of mine, and no you don't need a Windows machine, or even X running to do it.

    I make calls to external programs using IPC::Open3. For a while I was using WVware which has a utility wvPDF for converting from Word to PDF. It worked okay but it has to have latex and a bunch of fonts installed. At some point latex broke and it stopped working. I never have figured out the problem so YMMV. WvWare seemed a bit slow anyway.

    Next I tried Anti-Word which is much faster and goes straight to text or to Postscript. I use it for the conversion to Postscript and then use ps2pdf to create the final PDF file. The documents come out perfectly.

    Antiword is pretty small so in my spare time I've kind of been looking into ways it could be accessed directly from a Perl module. Not having much spare time or being very good at C has somewhat hindered that progress though :)

    Hope that helps,
    Chris

    Lobster Aliens Are attacking the world!
      Hi Chris,

      I make calls to external programs using IPC::Open3.

      I had a look at that on Cpan and "open a process for reading, writing, and error handling" didn't mean much to me though, sorry I don't understand how I would use that.

      Next I tried Anti-Word which is much faster and goes straight to text or to Postscript. I use it for the conversion to Postscript and then use ps2pdf to create the final PDF file. The documents come out perfectly.

      Antiword is pretty small so in my spare time I've kind of been looking into ways it could be accessed directly from a Perl module. Not having much spare time or being very good at C has somewhat hindered that progress though :)

      I have downloaded the *nix verion, but it looks like I'd need to compile all that in C, ... too much hassle and my brain hurts with that type of stuff. I'm downloading the 'Win' version, because it's a binary, and at a d/load speed of 0.2K/sec, it should be finished by tomorrow. You wouldn't be able to send me the *nix version of AntiWord' would you (please) ?

      Peter

        I have downloaded the *nix verion, but it looks like I'd need to compile all that in C

        Only if you download the source (which is usually what you get if you go directly to the home page of a project, but that's not the usual way most folks install software). You can probably get an Antiword package for your Linux distribution. If you use an RPM-based distro, for example, check on rpmfind.net. Gentoo also has an ebuild for it (app-text/antiword), (though you're probably not using Gentoo if compiling C code gives you a headache). I can't speak for Debian-based distros with any degree of certitude, as I've not recently used any of those except Knoppix, but I suspect apt-get antiword might make you Bob's nephew there. (That's a guess. If it doesn't work, ask someone who uses Debian. Last time I used Debian apt didn't exist yet; there was only dselect.)


        $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/

        No offense but the whole idea of Perlmonks is to learn, rather than getting people to do it for you. I don't mind answering questions, even about how to compile but don't give up so easily. Besides I'm on dial-up and it creates several files all of which I'd have to track down and send to you.

        Honestly its not that hard. If you open the readme file all you have to do (on Linux) is type 'make' and then 'make install' for a local installation in your home directory or 'make global_install' to install for the whole system as root. The later is probably what you want. If you aren't on Linux copy the appropriate 'Makefile.<your_os_name>' to just 'Makefile' and follow the same steps.

        As for the IPC::Open3 you don't have to use it, you can use system(), I just found that it gave me more control. See Advanced Perl Programming for some good examples on how to do it.

        Lobster Aliens Are attacking the world!
Re: Converting M$ Word --> PDF
by jonadab (Parson) on Jan 24, 2004 at 02:25 UTC
    I need to convert a M$ Word document

    Wow, it's hard to find something for that on CPAN. The terms "Microsoft", "word", and "document" all occur in the documentation for approximately every single module EVER, making it totally impossible to use them as search criteria. The only thing I managed to find that seems relevant at all is docclient.

    Failing the existence on CPAN of a module just for reading Word documents, I tend to agree with the guy who advised you to get OpenOffice and ooolib; though I haven't used ooolib yet personally, I know that OpenOffice generally does as excellent a job with Word documents as can be hoped for, given the immense complexity and extremely poor documentation for that format.

    Ideally, I would like to simply run a Perl script on the Linux box

    That shouldn't be a problem. Install OpenOffice on the Linux box; you already have Perl there, of course. That leaves ooolib, which according to the sourceforge project page runs on Linux. I've not used ooolib myself, though, since I usually write scripts that work with the XML; I don't have to deal with Word documents much. But now that I know ooolib exists, I'm making myself a note to check it out soon; it could be quite useful :-)


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
      The only thing I managed to find that seems relevant at all is docclient.

      Checked it out, the only thing that might be hard is "On the server machine, a Docserver application (usually docserver.pl program) has to be running."

      Will see how the OpenOffice and ooolib 'combo' goes.

      That shouldn't be a problem. Install OpenOffice on the Linux box; you already have Perl there, of course. That leaves ooolib, which according to the sourceforge project page runs on Linux. I've not used ooolib myself, though, since I usually write scripts that work with the XML; I don't have to deal with Word documents much. But now that I know ooolib exists, I'm making myself a note to check it out soon; it could be quite useful :-)

      If only there was a Perl module that was HTML::Word available, because I know there is a HTML::PDF there. I recently used Perl to create an Excle file, wow, could not have been easier, so I'm really surprised there is nothing in Perl that can create Word documents. (But then, even Clipper can create Excel files). I guess a lot depends on how much of the format of M$ Word Microsoft will release, because having made the comments on Excel, I do know the complete layout of Excel was available some years back. The bottom line I guess is, if M$ haven't released ALL the info on the structure of M$ Word files, then no-one is going to be able to create them (although isn't _that_ what OpenOffice can do ??)

      Peter

        If only there was a Perl module that was HTML::Word available, because I know there is a HTML::PDF there.

        Better would be WordProcessing::MSWord::Parse.

        I recently used Perl to create an Excle file, wow, could not have been easier, so I'm really surprised there is nothing in Perl that can create Word documents.

        Oh, there is some stuff for _creating_ Word documents, but I skipped over it for two reasons: _creating_ documents isn't what you asked for (you wanted to _read_ them and create something _else_ from them), and the modules I saw were rather more specialized than general (e.g., one of them was for creating reports having something to do with DBI I think, in Word document format). In general, creating documents in a partially-understood format is easier than parsing them, because for parsing you have to know whatever aspect of the format that the document happens to use. For generating documents, you just have to figure out the basics, and then you can use the regular means (e.g., Word) to create one that's like what you want and simply copy large parts of it without fully understanding them, substituting in your custom content each time in place of the dummy content from the initial document.

        I guess a lot depends on how much of the format of M$ Word Microsoft will release

        Unless I am greatly mistaken, most of what we know about the Word document format does not come from information that Microsoft has released.

        then no-one is going to be able to create them (although isn't _that_ what OpenOffice can do ??)

        OpenOffice inherited its Word input and output filters from StarDivision, who created them the same way that Corel did for the WordPerfect suite: by studying documents that were created with Word and figuring out what the different parts mean. The filters have been refined over the years and are getting to be quite good now, but there was some trial and error that went into getting them right; it wasn't as simple as reading a specification and implementing it. I suspect that the source code for the Word input and output filters built into OpenOffice is probably the best extant documentation of the Word document format outside of Microsoft. (Inside of Microsoft there is the source code for Word, of course.)


        $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://323450]
Approved by neuroball
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-12-28 23:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (183 votes), past polls