Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

How can I read the .docx file in perl?

by prabuvos (Initiate)
on Apr 17, 2013 at 08:52 UTC ( #1029093=perlquestion: print w/ replies, xml ) Need Help??
prabuvos has asked for the wisdom of the Perl Monks concerning the following question:

Hello Guys!, How can I read the .docx files in Perl using Liux Platform

Comment on How can I read the .docx file in perl?
Replies are listed 'Best First'.
Re: How can I read the .docx file in perl?
by roboticus (Chancellor) on Apr 17, 2013 at 12:49 UTC

    prabuvos:

    Unzip it ... it's just a collection of XML files. Then you can dig through them with your favorite XML parser. For many documents, you'll need only the items in the word directory.

    $ unzip hack.docx Archive: hack.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/_rels/header2.xml.rels inflating: word/footer2.xml inflating: word/footer1.xml inflating: word/footer3.xml inflating: word/header2.xml inflating: word/header1.xml inflating: word/endnotes.xml inflating: word/footnotes.xml inflating: word/header3.xml extracting: word/media/image1.jpeg inflating: word/theme/theme1.xml inflating: word/_rels/settings.xml.rels inflating: word/settings.xml inflating: word/styles.xml inflating: word/webSettings.xml inflating: word/numbering.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: docProps/custom.xml

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated – also used in some forms of extraction.

      If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.

      IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course.   Of course, ODF is also an XML-based format.   See http://oasis-open.org.

Re: How can I read the .docx file in perl?
by Anonymous Monk on Apr 17, 2013 at 08:59 UTC
Re: How can I read the .docx file in perl?
by marto (Bishop) on Apr 17, 2013 at 09:04 UTC

    Similar to your last post, I would suggest you spent some time investigating automating Libreoffice. In theory you could start Librewrite running in headless mode, listening on a port and then use UNO to query the document in question.

        This is a dead end, perluno/OpenOffice::UNO haven't worked for years.

        What do you mean by this?

        I wasn't suggesting it has to be done via an existing module.

        Update: I think it's worth noting that simply because this module hasn't worked for a long time if at all, this doesn't that this is impossible.

        Update: Fixed typo /nothing/noting/.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1029093]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2016-05-04 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?