In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated – also used in some forms of extraction.
If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.
IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course. Of course, ODF is also an XML-based format. See http://oasis-open.org.