|Just another Perl shrine|
How to Truncate Corrupt Document.xml Files?by socrtwo (Sexton)
|on Feb 16, 2012 at 00:33 UTC||Need Help??|
socrtwo has asked for the
wisdom of the Perl Monks concerning the following question:
Hi Monks. My apologies if I'm offending anybody by asking for help too early in the development of the solution for this issue.
I'm working with corrupt MS Word Office Open docx format files. These are files in which the document.xml part where the document's text resides is corrupt, often because they are truncated randomly.
By rough experimentation I have found that if I clean out a partial tag at the end of a the xml file and then add </w:t></w:r></w:p></w:body></w:document> and then rezip whatever was recoverable from the zip structure using a corruption tolerant unzipper like CakeCMD or no-frills unzipper, I can get MS Word to open the file.
I found with one file where the truncation occurred in the middle of a table, that I needed to add instead </w:t></w:r></w:p></w:tc></w:tr></w:tbl></w:body></w:document>. Then I rezipped the files and Word again could open it.
So basically now my holy grail is a generalized solution Perl script which can process document.xml files and truncate the file just before the first XML error and then add the appropriate XML closing tags to not offend MS Word 2007 or 2010. I thought one way was to make a list of all non-self closing tags in the document.xml and then step backwards looking for the first instances of those unclosed tags and adding their respective closing tags in the order that the unclosed tags were encountered.
I had a look at CPAN and nothing jumped out at me about how to find the first error in an XML file, nor how to truncate an XML fil not to speak of walking back an XML file looking for unclosed tags and closing them in the order found. I did see elsewhere that a regular expression <[^<>]+[^/]> would return those opening tags that are not self closing. I know I'm fishing for some code from you all but just some help as to what CPAN module to use and maybe a better overall idea about how to approach this programmatically would be nice.
I was looking at PHP too, however I think I want to do this in PERL and then compile it for use both in my free MS Office service (which now just extracts text, and doesn't return full Word files) and a planned open source VB.NET Word recovery program