Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

MS Word conversion

by gibsonca (Beadle)
on Oct 07, 2009 at 02:38 UTC ( #799637=perlquestion: print w/replies, xml ) Need Help??
gibsonca has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to not only convert an MS Word document to text, but to then recreate the original MS Word document from that text file? My goal is to automate some changes to a set of MS Word documents. humble perl user

Replies are listed 'Best First'.
Re: MS Word conversion
by JadeNB (Chaplain) on Oct 07, 2009 at 03:25 UTC

    Here's one way of using Perl to read Microsoft word documents. I don't have Windows or Word to test—and the post cleverly mentions that solutions like this go out of date rapidly, but doesn't provide any date for the post itself, so I have no idea how old it is. Anyway, this might be enough to get you started. (There's a Win32::Word::Writer module, but, of course, that's the opposite of what you want.)

    The simple and glib answer to “Can you re-construct the original Word file?” is “No”. Is it acceptable to convert to some other format instead of plain text? Word itself can save as RTF or HTML, I think; the HTML it generates is pretty awful, but there are programs to tidy up for you. An HTML document will be much more faithful to the original than plain (unformatted) text would be.

Re: MS Word conversion
by dHarry (Abbot) on Oct 07, 2009 at 10:16 UTC
    recreate the original MS Word document from that text file

    I think it is safe to say you can forget about that. Not a 100% anyway. Also there are quite a few versions around ranging from .doc (Word95) to .docx (Word 2007). You don't mention your specific format.

    My goal is to automate some changes to a set of MS Word documents.

    I think you better try your luck with some VBA macros (Uff).

Re: MS Word conversion
by mickep76 (Beadle) on Oct 07, 2009 at 06:40 UTC


    There are numerous applications that do this but they all introduce some problems in compatibility and how the document looks like. Not even OpenOffice does this perfectly.

    I think the best you can do is to use the facilities in Word, no matter how much it might hurt your sensibilities and ours.

    But I won't spell the beast by name

Re: MS Word conversion
by Bloodnok (Vicar) on Oct 07, 2009 at 16:49 UTC
    The conversion from M$ Word format to text, unless it's to RTF (Rich Text Format), will lose all embedded formatting information thus the answer to your 1st question, as others have pointed out, is no.

    As for the 2nd question, I've always used Win32::OLE to automate tasks on M$ documents in their native format - assuming the M$ application supports OLE (some don't).

    A user level that continues to overstate my experience :-))

      The examples on the web for modifying an existing MS Word (2003 for me) either do little, or they error out. May I have an example where something is searched for in the document, and a sub paragraph or section added? Thanks for your time!

        Word 2003 has a "new" default file format, typically indicated by a filename extension of .docx instead of .doc. The docx format is a ZIP file containing ugly, generated XML and some other stuff. So, in theory, you just unzip the docx file, modify the XML, and zip it back to a new docx file. I think Microsoft has a description of the XML format they use, burried somewhere on their website.


        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://799637]
Approved by AnomalousMonk
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2017-01-24 17:34 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (208 votes). Check out past polls.