Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: How can I read the .docx file in perl?

by roboticus (Chancellor)
on Apr 17, 2013 at 12:49 UTC ( #1029137=note: print w/replies, xml ) Need Help??

in reply to How can I read the .docx file in perl?


Unzip it ... it's just a collection of XML files. Then you can dig through them with your favorite XML parser. For many documents, you'll need only the items in the word directory.

$ unzip hack.docx Archive: hack.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/_rels/header2.xml.rels inflating: word/footer2.xml inflating: word/footer1.xml inflating: word/footer3.xml inflating: word/header2.xml inflating: word/header1.xml inflating: word/endnotes.xml inflating: word/footnotes.xml inflating: word/header3.xml extracting: word/media/image1.jpeg inflating: word/theme/theme1.xml inflating: word/_rels/settings.xml.rels inflating: word/settings.xml inflating: word/styles.xml inflating: word/webSettings.xml inflating: word/numbering.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: docProps/custom.xml


When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: How can I read the .docx file in perl?
by sundialsvc4 (Abbot) on Apr 17, 2013 at 13:35 UTC

    In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated also used in some forms of extraction.

    If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.

    IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course.   Of course, ODF is also an XML-based format.   See

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1029137]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2018-05-22 16:04 GMT
Find Nodes?
    Voting Booth?