Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

How can I read the .docx file in perl?

by prabuvos (Initiate)
on Apr 17, 2013 at 08:52 UTC ( #1029093=perlquestion: print w/replies, xml ) Need Help??
prabuvos has asked for the wisdom of the Perl Monks concerning the following question:

Hello Guys!, How can I read the .docx files in Perl using Liux Platform

Replies are listed 'Best First'.
Re: How can I read the .docx file in perl?
by roboticus (Chancellor) on Apr 17, 2013 at 12:49 UTC


    Unzip it ... it's just a collection of XML files. Then you can dig through them with your favorite XML parser. For many documents, you'll need only the items in the word directory.

    $ unzip hack.docx Archive: hack.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/_rels/header2.xml.rels inflating: word/footer2.xml inflating: word/footer1.xml inflating: word/footer3.xml inflating: word/header2.xml inflating: word/header1.xml inflating: word/endnotes.xml inflating: word/footnotes.xml inflating: word/header3.xml extracting: word/media/image1.jpeg inflating: word/theme/theme1.xml inflating: word/_rels/settings.xml.rels inflating: word/settings.xml inflating: word/styles.xml inflating: word/webSettings.xml inflating: word/numbering.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: docProps/custom.xml


    When your only tool is a hammer, all problems look like your thumb.

      In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated also used in some forms of extraction.

      If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.

      IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course.   Of course, ODF is also an XML-based format.   See

Re: How can I read the .docx file in perl?
by Anonymous Monk on Apr 17, 2013 at 08:59 UTC
Re: How can I read the .docx file in perl?
by marto (Bishop) on Apr 17, 2013 at 09:04 UTC

    Similar to your last post, I would suggest you spent some time investigating automating Libreoffice. In theory you could start Librewrite running in headless mode, listening on a port and then use UNO to query the document in question.

        This is a dead end, perluno/OpenOffice::UNO haven't worked for years.

        What do you mean by this?

        I wasn't suggesting it has to be done via an existing module.

        Update: I think it's worth noting that simply because this module hasn't worked for a long time if at all, this doesn't that this is impossible.

        Update: Fixed typo /nothing/noting/.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1029093]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2016-10-22 23:57 GMT
Find Nodes?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?

    Results (299 votes). Check out past polls.