Re: How to Truncate Corrupt Document.xml Files?

by educated_foo (Vicar)
in reply to How to Truncate Corrupt Document.xml Files?

I would start by using a streaming (SAX) parser and maintaining a stack of unclosed tags. Have you tried that yet?
Re^2: How to Truncate Corrupt Document.xml Files?
by socrtwo (Sexton) on Feb 16, 2012 at 02:11 UTC
    I haven't tried that yet. Thanks for heads up. I'm looking at streaming SAX parsing now. I see the Ruby Gem Nokogiri may be well suited for this but there are a lot of SAX modules in Perl and I don't know anything about Ruby at the moment, but I know a little of Perl.
      I don't parse much XML (thank God), but XML::Parser (originally written by Larry Wall) has always been pretty straightforward to use -- just define Start() and End() handlers for a start.

        I read that the SAX parser is not so good for rebuilding the XML document which is what I want to do, unless I use 2 parsing instances, one as a SAX parser to analyze the document.xml file and the other with XML::Parser to actually add the intended end tags and rebuild the document.xml.

        However is there any real benefit to this use of SAX? Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out?

