http://www.perlmonks.org?node_id=497316

DaWolf has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, brothers and sisters.

I'm with a very complicated problem and I was hoping that Perl can give me a hand.

I've seen a lot of nodes about MSWord files, but all of them seem to point to Win32::OLE wich is a great module, but I don't believe will help me, since I need - at least if possible - to do this in a linux server.

Here's the scenario:

It's basically a file server, wich has to control the documents published on it using a web interface.

So, when a user submits a document (tipically a MSWord file), the application needs to append a customized header to this file. Please note that by "header" I mean a MSWord document header, basically a table with some information, company logo, etc...

Problem #1: How to manipulate MSWord files without "opening" MSWord via Win32::OLE?

Then, when a user clicks on the file link, it should be able to view it in the MSWord-MSIE integrated interface BUT it shouldn't be able to change the file.

The only way the user could change the contents of the document is after downloading it. After the changes the user should then re-submit the document to the system, wich will then append a new header to it and so on...

Problem #2: How to prevent a MSWord document opened in the MSWord-MSIE integrated interface to be changed?

A possible solution to both problems is to convert the contents of the file to another format, like HTML, and let the user view the contents directly in MSIE.

The .doc file would only be available for the user via download to make the changes.

This solution has a problem, however: the conversion must be perfect.

Is there a module that can make a ***decent*** doc -> html conversion? By decent I mean preserving tables, inline images, etc...?

As you can see I'm pretty much lost here, so I was hoping that someone could give me some advices/alternatives, etc...

TIA,

  • Comment on Manipulating MSWord files in a linux box?

Replies are listed 'Best First'.
Re: Manipulating MSWord files in a linux box?
by terra incognita (Pilgrim) on Oct 04, 2005 at 17:01 UTC
    Have you thought about saving your Word documents as XML? IIRC you need Office Professional version to do this.

    This would allow you to display the document over the web with out changing it (IE will figure out and display the file properly using XSL). As well you can lock the file down by using web server controls. This would result in users being able to modify the client side doc but not being able to save it back to the server unless they do it your way.

    When a user downloads then saves a document back the server you can then check that the file contains the proper header (both text and images can be supported), if it does not then you can delete the old header and put a new one in it's place. Update

    The body of the Word doc is contained within the <w:body> tags. This also contains the header and footer information in it as well. The header/footer info is stored under the <w:sectPr> node.

    Here is a bare body that only contains the word "BODY".

    <w:body> <wx:sect> <w:p> <w:r> <w:t>BODY</w:t> </w:r> </w:p> <w:p/> <w:p/> <w:p/> </wx:sect> </w:body>
    Here is document body with the header and footer.
    <w:body> <wx:sect> <w:p> <w:r> <w:t>BODY</w:t> </w:r> </w:p> <w:p/> <w:p/> <w:p/> <w:sectPr> <w:hdr w:type="odd"> <w:p> <w:pPr> <w:pStyle w:val="Header"/> </w:pPr> <w:r> <w:t>Header1</w:t> </w:r> <w:r> <w:tab wx:wTab="3525" wx:tlc="none" wx:cTlc="58"/> </w:r> <w:r> <w:tab wx:wTab="4320" wx:tlc="none" wx:cTlc="71"/> </w:r> </w:p> </w:hdr> <w:ftr w:type="odd"> <w:p> <w:pPr> <w:pStyle w:val="Footer"/> </w:pPr> <w:r> <w:t>Footer1</w:t> </w:r> </w:p> </w:ftr> <w:pgSz w:w="12240" w:h="15840"/> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:h +eader="720" w:footer="720" w:gutter="0"/> <w:cols w:space="720"/> <w:docGrid w:line-pitch="360"/> </w:sectPr> </wx:sect> </w:body>
    Hope this helps.
      Thanks, it's a very interesting suggestion and I like the idea of the content being in a structured XML file.

      The only bad thing about this is that I can't predict if the client will have the appropriate Office/MSWord version, and since it's a commercial product it would be bad to force the client to buy a software so he can use mine...

      Update: on a related subject, take a look at Problems with Microsft's new Office 'XML'. It seems that MS is already causing us trouble...

        I agree - I'd hate for you to force me to use MSWord at all ;-)

        Alternate suggestions. Whether they're nearly as draconian is up to you.

        Manipulate PDF files instead. Require the user to send you a PDF instead of a doc file, and use that instead. Have a link to some open-source print-driver that outputs to PDF for those who have older word processors on Windows. Have a link to openoffice.org for all platforms as a good way to write to PDF.

        Manipulate openoffice.org .sxw files instead. Forcing the client to buy software is one thing. Forcing them to use software that doesn't cost them anything is quite a bit different. Different enough? You decide.

        Otherwise, you'll probably be stuck with moving your code over to a Windows box where you can use OLE. Even that is dangerous in a web server - I'm not sure what happens when two users come in at the same time ... ;-) At the very least, test it to be sure.