Re: Converting M$ Word --> PDF
by neuroball (Pilgrim) on Jan 23, 2004 at 05:52 UTC
|
After a bit of research I found a solution that just might do it for you:
- Download OpenOffice for whatever OS you would like to use and install it.
- Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.
- In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.
Btw. OpenOffice does automatically open Word files.
/oliver/
| [reply] |
|
This is a good suggestion. Too bad neuroball beat me to it. :-)
Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.
| [reply] |
|
Hi Roger,
Just to add to the above comment - if your office pack does not allow printing to PDF, you can print to Postscript and then convert the Postscript to PDF using ghostscript.
No, my Office 2000 doesn't seem to have this, unless there is a service pack or other update I haven't installed ??
Peter
| [reply] |
|
|
|
|
You can get a similiar result exporting ms-word docs to html. Then using html2ps to convert the file to postscript. Convert the postscript to PDF using ps2pdf.
something like ...
- export file to html using MSWord/OO to say file.doc->file.html
- using cygwin on windows (or copy file to *nix sys)
- perl /usr/bin/html2ps file.html > file.ps
- ps2pdf file.ps
If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.
It works for text. But I have not tried text/graphics or plain graphics. Anyone had experience with graphics using this approach?
| [reply] |
|
Hi g00n,
Thanks for your tips on how to go
M$ Word -->HTML-->PS--> PDF
If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.
I do have cygwin installed on the Win box, but I do have access to the Linux box at the website (shell) also. The less steps and less 'box changes' the better. My reply to "neuroball", the 3 steps is the ideal situation, but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.
Peter
| [reply] |
|
|
|
Hi neuroball,
Download OpenOffice for whatever OS you would like to use and install it
Okay, I only have Win95, so the 1.0.x version is the only one I can install.
Go to the ooolib web site on sourceforge and download/install it. This will make OpenOffice's API accessible to perl.
Okay, will do. :)
In OpenOffice you can "Print to File" and set the filetype to "PDF". You just have to find out how to access this functions from the ooolib level.
Okay, I don't know how this all fits together with Perl (because I noticed 'ooolib' is a Perl library). No doubt OpenOffice must spawn a Perl process, I don't know ??
Btw. OpenOffice does automatically open Word files
I do have Word 2000, as part of Office Developer 2000 suite, but I can see it will not help me, but your solution will. The reason I need to do this is everytime a client wants me to update his catalogue on the website, I can change it in Word, but there is also a PDF catalog on the website, which of course also needs updating. He (the client) has the tools (Adobe) to convert the new catalog from Word --> PDF, but I don't. I usually have to ask him 10 to 15 times to convert it, even though it's a 5 min job. So, this gets rather a pain in the ..., after a while, and I would ideally like to do this:
1. Have the 'catalog' on the website in HTML format.
2. Use Perl to convert to PDF
3. Use Perl to convert to Word
I know I have seen a Perl module to do step 2, don't know if there is a Perl module to go HTML --> PDF though.
Thanks, :)
Peter
| [reply] |
|
Peter, you can just use the above concept also to covert from HTML to PDF. I have no idea where the limits are though.
Just use perl to open the HTML file in OpenOffice and then make OpenOffice print it to PDF. I just tried it with a google page, and as long as the images are local, no problems should arise.
If you want another way you might try the following:
- Download HTMLdoc, which is GPL'ed, and install it.
- Download HTML::HTMLdoc from CPAN and install it.
- Do some perl magic to get what you want...
- ...Unknown step...
- Profit!
/oliver/
| [reply] |
|
Re: Converting M$ Word --> PDF
by CountZero (Bishop) on Jan 23, 2004 at 06:48 UTC
|
ADOBE allows you to have 5 documents converted to PDF for free here.And of course one could think of buying and installing Acrobat Professional, which allows you to make your own PDF-files.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
|
CountZero,
Thanks for the link to the 5 free conversions. That might be a good short term solution. :)
And of course one could think of buying and installing Acrobat Professional, which allows you to make your own PDF-files.
Some severe health problems == no cash to splash on such tools. :(
Peter
| [reply] |
Re: Converting M$ Word --> PDF
by cfreak (Chaplain) on Jan 23, 2004 at 17:01 UTC
|
I've done this for a customer of mine, and no you don't need a Windows machine, or even X running to do it.
I make calls to external programs using IPC::Open3. For a while I was using WVware which has a utility wvPDF for converting from Word to PDF. It worked okay but it has to have latex and a bunch of fonts installed. At some point latex broke and it stopped working. I never have figured out the problem so YMMV. WvWare seemed a bit slow anyway.
Next I tried Anti-Word which is much faster and goes straight to text or to Postscript. I use it for the conversion to Postscript and then use ps2pdf to create the final PDF file. The documents come out perfectly.
Antiword is pretty small so in my spare time I've kind of been looking into ways it could be accessed directly from a Perl module. Not having much spare time or being very good at C has somewhat hindered that progress though :)
Hope that helps,
Chris
Lobster Aliens Are attacking the world!
| [reply] |
|
Hi Chris,
I make calls to external programs using IPC::Open3.
I had a look at that on Cpan and "open a process for reading, writing, and error handling" didn't mean much to me though, sorry I don't understand how I would use that.
Next I tried Anti-Word which is much faster and goes straight to text or to Postscript. I use it for the conversion to Postscript and then use ps2pdf to create the final PDF file. The documents come out perfectly.
Antiword is pretty small so in my spare time I've kind of been looking into ways it could be accessed directly from a Perl module. Not having much spare time or being very good at C has somewhat hindered that progress though :)
I have downloaded the *nix verion, but it looks like I'd need to compile all that in C, ... too much hassle and my brain hurts with that type of stuff. I'm downloading the 'Win' version, because it's a binary, and at a d/load speed of 0.2K/sec, it should be finished by tomorrow. You wouldn't be able to send me the *nix version of AntiWord' would you (please) ?
Peter
| [reply] |
|
I have downloaded the *nix verion, but it looks like I'd need to compile all that in C
Only if you download the source (which is usually what
you get if you go directly to the home page of a
project, but that's not the usual way most folks
install software).
You can probably get an Antiword package for your
Linux distribution. If you use an RPM-based distro,
for example, check on rpmfind.net.
Gentoo also has an ebuild for it (app-text/antiword),
(though you're probably not using Gentoo if compiling
C code gives you a headache). I can't speak for
Debian-based distros with any degree of certitude,
as I've not recently used any of those except Knoppix,
but I suspect apt-get antiword might
make you Bob's nephew there. (That's a guess. If
it doesn't work, ask someone who uses Debian. Last
time I used Debian apt didn't exist yet; there was
only dselect.)
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] [select] |
|
|
|
No offense but the whole idea of Perlmonks is to learn, rather than getting people to do it for you. I don't mind answering questions, even about how to compile but don't give up so easily. Besides I'm on dial-up and it creates several files all of which I'd have to track down and send to you.
Honestly its not that hard. If you open the readme file all you have to do (on Linux) is type 'make' and then 'make install' for a local installation in your home directory or 'make global_install' to install for the whole system as root. The later is probably what you want. If you aren't on Linux copy the appropriate 'Makefile.<your_os_name>' to just 'Makefile' and follow the same steps.
As for the IPC::Open3 you don't have to use it, you can use system(), I just found that it gave me more control. See Advanced Perl Programming for some good examples on how to do it.
Lobster Aliens Are attacking the world!
| [reply] [d/l] [select] |
Re: Converting M$ Word --> PDF
by jonadab (Parson) on Jan 24, 2004 at 02:25 UTC
|
I need to convert a M$ Word document
Wow, it's hard to find something for that on CPAN. The
terms "Microsoft", "word", and "document" all occur in
the documentation for approximately every single
module EVER, making it totally impossible to use them
as search criteria. The only thing I managed to find
that seems relevant at all is docclient.
Failing the existence on CPAN of a module just for
reading Word documents, I
tend to agree with the guy who advised you to get
OpenOffice and ooolib; though I haven't used ooolib
yet personally, I know that OpenOffice generally does
as excellent a job with Word documents as can be
hoped for, given the immense complexity and extremely
poor documentation for that format.
Ideally, I would like to simply run a Perl script on the Linux box
That shouldn't be a problem. Install OpenOffice on
the Linux box; you already have Perl there, of course.
That leaves ooolib, which according to the sourceforge
project page runs on Linux. I've not used ooolib
myself, though, since I usually write scripts that
work with the XML; I don't have to deal with Word
documents much. But now that I know ooolib exists,
I'm making myself a note to check it out soon; it
could be quite useful :-)
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] |
|
The only thing I managed to find that seems relevant at all is docclient.
Checked it out, the only thing that might be hard is "On the server machine, a Docserver application (usually docserver.pl program) has to be running."
Will see how the OpenOffice and ooolib 'combo' goes.
That shouldn't be a problem. Install OpenOffice on the Linux box; you already have Perl there, of course. That leaves ooolib, which according to the sourceforge project page runs on Linux. I've not used ooolib myself, though, since I usually write scripts that work with the XML; I don't have to deal with Word documents much. But now that I know ooolib exists, I'm making myself a note to check it out soon; it could be quite useful :-)
If only there was a Perl module that was HTML::Word available, because I know there is a HTML::PDF there. I recently used Perl to create an Excle file, wow, could not have been easier, so I'm really surprised there is nothing in Perl that can create Word documents. (But then, even Clipper can create Excel files). I guess a lot depends on how much of the format of M$ Word Microsoft will release, because having made the comments on Excel, I do know the complete layout of Excel was available some years back. The bottom line I guess is, if M$ haven't released ALL the info on the structure of M$ Word files, then no-one is going to be able to create them (although isn't _that_ what OpenOffice can do ??)
Peter
| [reply] |
|
If only there was a Perl module that was HTML::Word available, because I know there is a HTML::PDF there.
Better would be WordProcessing::MSWord::Parse.
I recently used Perl to create an Excle file, wow, could not have been easier, so I'm really surprised there is nothing in Perl that can create Word documents.
Oh, there is some stuff for _creating_ Word documents,
but I skipped over it for two reasons: _creating_
documents isn't what you asked for (you wanted to
_read_ them and create something _else_ from them),
and the modules I saw were rather more specialized
than general (e.g., one of them was for creating
reports having something to do with DBI I think, in
Word document format). In general, creating documents
in a partially-understood format is easier than
parsing them, because for parsing you have to know
whatever aspect of the format that the document
happens to use. For generating documents, you just
have to figure out the basics, and then you can use
the regular means (e.g., Word) to create one that's
like what you want and simply copy large parts of it
without fully understanding them, substituting in
your custom content each time in place of the dummy
content from the initial document.
I guess a lot depends on how much of the format of M$ Word Microsoft will release
Unless I am greatly mistaken, most of what we know
about the Word document format does not come from
information that Microsoft has released.
then no-one is going to be able to create them (although isn't _that_ what OpenOffice can do ??)
OpenOffice inherited its Word input and output filters
from StarDivision, who created them the same way that
Corel did for the WordPerfect suite: by studying
documents that were created with Word and figuring
out what the different parts mean. The filters have
been refined over the years and are getting to be
quite good now, but there was some trial and error
that went into getting them right; it wasn't as
simple as reading a specification and implementing
it. I suspect that the source code for the Word
input and output filters built into OpenOffice is
probably the best extant documentation of the
Word document format outside of Microsoft. (Inside
of Microsoft there is the source code for Word, of
course.)
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] |