In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated also used in some forms of extraction.
If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.
IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course. Of course, ODF is also an XML-based format. See http://oasis-open.org.