Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: How can I read the .docx file in perl?

by roboticus (Chancellor)
on Apr 17, 2013 at 12:49 UTC ( #1029137=note: print w/ replies, xml ) Need Help??


in reply to How can I read the .docx file in perl?

prabuvos:

Unzip it ... it's just a collection of XML files. Then you can dig through them with your favorite XML parser. For many documents, you'll need only the items in the word directory.

$ unzip hack.docx Archive: hack.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/_rels/header2.xml.rels inflating: word/footer2.xml inflating: word/footer1.xml inflating: word/footer3.xml inflating: word/header2.xml inflating: word/header1.xml inflating: word/endnotes.xml inflating: word/footnotes.xml inflating: word/header3.xml extracting: word/media/image1.jpeg inflating: word/theme/theme1.xml inflating: word/_rels/settings.xml.rels inflating: word/settings.xml inflating: word/styles.xml inflating: word/webSettings.xml inflating: word/numbering.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: docProps/custom.xml

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: How can I read the .docx file in perl?
Download Code
Replies are listed 'Best First'.
Re^2: How can I read the .docx file in perl?
by sundialsvc4 (Abbot) on Apr 17, 2013 at 13:35 UTC

    In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated – also used in some forms of extraction.

    If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.

    IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course.   Of course, ODF is also an XML-based format.   See http://oasis-open.org.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1029137]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2016-05-30 02:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?