Accessing Meta data from MS WORD

by ghouse_55 (Initiate)
Hi Need some info on accessing MS word meta data information of a document via Perl Is this possible via Perl & can you give me some guidance on how to go about this

Re: Accessing Meta data from MS WORD
by thmsdrew (Scribe) on Aug 07, 2012 at 10:46 UTC

    Well a .docx file is actually just an archive file containing the metadata that you speak of. In Perl it is possible to access an archive file, extract the metadata (which would be a .xml file), and then you can parse the .xml file for what you need. These tasks are accomplished with specific Perl modules that can be found on CPAN.

      ... assuming of course that the file in question is in Microsoft's OpenXML format. Older versions of Word used the proprietary binary ".doc" format, which is still quite frequently used.

Re: Accessing Meta data from MS WORD
by sundialsvc4 (Abbot) on Aug 07, 2012 at 12:58 UTC

    /me nods...

    IIRC, docx is an XML-formatted file with a well-known public schema, zip-compressed.   If you do not already find a CPAN module to do what you want, an approach could be to write code that unzips it, then attacks the XML content using XPath expressions ... thus avoiding the need to write code to match the XML internal structure.   But it is extremely likely that what you are doing is “a thing already done.”

Node Type: perlquestion
