Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Repair malformed XML

by spoulson (Beadle)
on Feb 03, 2005 at 15:20 UTC ( #427666=perlquestion: print w/ replies, xml ) Need Help??
spoulson has asked for the wisdom of the Perl Monks concerning the following question:

I'm using a program (by Microsoft, no less) that generates an extremely large (80MB) XML file. In some cases, this program is known to output malformed XML. Closing tags are missing in much of the file. There is a possibility this could be done with some tool out there, but I'd really like to see if Perl can do this quickly and easily with existing modules.

An example of the malformed XML looks like:

<cc:files> <o:destination><![CDATA[/Documents/some file.pdf]]></o:destinati +on> <y:name><![CDATA[some file.pdf]]></y:name> <r:DenyAccess dt="mv.string"></r:DenyAccess> <o:version> <d:contentclass><![CDATA[urn:content-classes:baseddocument]]> +</d:contentclass> <o:source><![CDATA[\sources\some file.pdf\some file.pdf]]></o +:source> <a:FriendlyVersionID>1.0</a:FriendlyVersionID> <a:owner><![CDATA[SERVER\iusr_server]]></a:owner> <a:CreatedTimeStamp>2/1/2005 6:41:30 PM</a:CreatedTimeStamp> <a:DocumentState>approved</a:DocumentState> <a:IsCurrentVersion>False</a:IsCurrentVersion> </cc:files>
Here, you can see that <o:version> does not have a closing tag. While I do not have a DTD or Schema of this file, I believe I can make the assumption that <o:version> encloses everything up to </cc:files>. This pattern of missing </o:version> tags repeats over and over throughout the XML file for each file that it describes (there are 20,000+ files). But, there are some cases in the XML where the </o:version> closing tag does appear where it should. The process needs to determine if the tag is missing. Heck, if any closing tag is missing.

I know XML::LibXML won't like it, because it must be well-formed. I imagine XML::Parser could do this, but I can't really visualize how to do it. Could someone please offer some wisdom?

Comment on Repair malformed XML
Download Code
Re: Repair malformed XML
by Tanktalus (Canon) on Feb 03, 2005 at 15:33 UTC

    I know this isn't the solution to your problem per se (and I anxiously await any good solution since I could find many a use for it!), but I know that if my company were to have an XML output like this, we would have so many customers complaining ...

    I realise that MS's home user support is nearly non-existant. But if this is something you're doing at work (and I would bet that's the case), have you tried raising a stink internally to the point where your contact person with MS raises a stink with them? (If that's you, and if you're not a manager, I would get your manager's approval to go raise a stink - most people would be willing to let you, I would hope.) It's probably going to be cheaper than writing a fix-it tool.

Re: Repair malformed XML
by Anonymous Monk on Feb 03, 2005 at 15:46 UTC
    Well, reporting missing closing tags is trivial. Just for each type of element, count the number of opening tags, and the number of closing tags. If they are equal, no closing tags are missing (assuming no openings tags are missing). Else, the difference is the number of closing tags missing.

    As for repairing -- without a DTD, it's going to be heuristics. And I'm not going to suggest any heuristics based on a tiny sample (643 bytes out of 80 Mb, about 0.00077%) of the file.

      If I reverse engineered a DTD, would my chances of earning my XML repair badge be better? What module is capable of validating against DTD to identify a dropped tag like this?
        If I reverse engineered a DTD, would my chances of earning my XML repair badge be better?
        Maybe. That will depend on the DTD. But how do you know that what you reverse engineer is correct? Or perhaps you reverse engineer a DTD (which may, or may not) be correct, and allows non-ambigious repairs. (That's not so far-fetched. Consider an HTML or XHTML document with the some of the </EM> tags missing. It will not always be clear where to insert the missing tags, even if you assume they belong just before or after some other tag).

        One disadvantage of attempting to repair, and not knowing how to recognize a correct document, is that you may end up with a document that is well-formed, or even conforming to the DTD you have, or reversed engineered, is that you do not know whether you ended up with the right document.

        Consider a Perl program of which a quote is missing. You could write a "repair" program that noticed a quote is missing, and puts a quote back into the program. Now, if you just randomly inserted the quote in the program, you're likely to end up with a program that still doesn't compile. But for most programs that are missing a quote, there will be more than one place the quote can be inserted, and you still have a compilable program. Which one should your repair program pick? How does it now it's right?

      I agree that this is largely a guess, but there is one relatively simple heuristic that might actually help this case. Well-formed XML documents may nest tags, but can't have an inner tag close after the enclosing tag. For example:

      <document><text>Some text</text></document> <!-- Valid --> <document><text>Some text</document></text> <!-- INVALID -->

      So, an algorithm that makes sure nested tags are closed before the enclosing tags is a good step, and if the sample above is representative such a step will likely go a long way toward solving the problem.

      Anima Legato
      .oO all things connect through the motion of the mind

        Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

        Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

Re: Repair malformed XML
by rg0now (Chaplain) on Feb 03, 2005 at 16:32 UTC
    I would definitely give XML::LibXML a try. It has a nice command line tool, xmllint, which can make wonders if used correctly. On the other hand, if you want Perl, you should experience with setting the recover flag of the XML::LibXML::Parser object to true. Although the manual states that it is for parsing HTML, it, as far as I can tell, serves for parsing ill-formatted XML just as well.

    The quick and dirty hack below could repair your badly formatted XML snippet (after adding the missing namespace declarations):

    use XML::LibXML; my $parser = XML::LibXML->new(); $parser->recover(1); my $doc = $parser->parse_file($ARGV[0]); print $doc->toString(1);
    Note that, however, I am not entirely sure that it always gueesses right on adding the remaining closing tags back, so I would not rely on this feature...

    rg0now

      I was unaware of the recover property. Your code example worked great on a test xml with a missing tag.

      However, it appears I've reached a size limitation on the LibXML library. Both xmllint and the Perl code indicate problems parsing corrupt data:

      my.xml:85: parser error : expected '>' tentclasses>True</s:closedexpect"???O??,???"?,???O??,?O +?O?"?O?
      I've toyed with XML::Parser some more. I've given it simple handlers to print the tags that are parsed, but XML::Parser croaks when it detects the missing tag, without first allowing a handler to override it.

      Is there something I'm missing?

        You don't say what version of perl you're using. My first attempt to use XML::Twig was with perl 5.6, and it died a horrible death ... simply upgrading perl to 5.8.1 was sufficient to handle the reading/writing of XML that I was doing with no other changes (same level of XML::Twig, my code unchanged). If you're not using 5.8 for XML handling, I highly suggest it.

        I am a little lost here. You told us that all the problems you have with your XML is that it has some unclosed tags. XML::LibXML::Parser's recover flag will handle it, as the manual tells:

        "The recover mode helps to recover documents that are almost wellformed very efficiently. That is for example a document that forgets to close the document tag (or any other tag inside the document)."

        Now, you seem to indicate that some tags in your XML are corrupt. Well, I do not really know, how to handle that one...

        Also, I do not think that you hit some obscure size limitations of XML::LibXML (you seem to get the error at the 85th input line).

      I stand corrected about the size limitation. Upon further testing, it is not the size, but the encoding. The XML file is unicode with encoding="iso-10646-ucs-2". If I convert to ASCII and set encoding="UTF-8", LibXML parses it fine.

      Unfortunately, the output of above script becomes mangles after a few thousand lines. It begins to only output the Text objects, and no tags, cdata's, etc. Strange.

      While I haven't discovered a generalized and automated method that works, I've managed to get by with a simple procedural rule of inserting </o:version> tags before </cc:files> if not already present. Then I convert back to Unicode and the XML can be parsed.

Re: Repair malformed XML
by halley (Prior) on Feb 03, 2005 at 17:02 UTC
    Like you, I don't have the DTD, but I'm not sure of your assumption. I think the indentation gives you the sense that <o:version> should span until just before </cc:files>, but looking at how other o: tags work, I think this would be the proper closing:
    <cc:files> <o:destination><![CDATA[/Documents/some file.pdf]]></o:destinati +on> <y:name><![CDATA[some file.pdf]]></y:name> <r:DenyAccess dt="mv.string"></r:DenyAccess> <o:version> <d:contentclass><![CDATA[urn:content-classes:baseddocument]]> +</d:conten +tclass> </o:version> <!-- <<<<<<<<<<<<< > <o:source><![CDATA[\sources\some file.pdf\some file.pdf]]></o:so +urce> <a:FriendlyVersionID>1.0</a:FriendlyVersionID> <a:owner><![CDATA[SERVER\iusr_server]]></a:owner> <a:CreatedTimeStamp>2/1/2005 6:41:30 PM</a:CreatedTimeStamp> <a:DocumentState>approved</a:DocumentState> <a:IsCurrentVersion>False</a:IsCurrentVersion> </cc:files>

    --
    [ e d @ h a l l e y . c c ]

      Yes, and I believe I was not clear on all the points that lead me to believe how to repair the XML. Here is one of the well-formed cc:files tags also found in the same document:
      <cc:files> <o:destination><![CDATA[/Documents/some file1.pdf]]></o:destinat +ion> <y:name><![CDATA[some file1.pdf]]></y:name> <r:DenyAccess dt="mv.string"></r:DenyAccess> <o:version> <d:contentclass><![CDATA[urn:content-classes:Cancelled]]></d: +contentclass> <o:source><![CDATA[\sources\some file1.pdf\some file1(1.0).pd +f]]></o:source> <a:FriendlyVersionID>1.0</a:FriendlyVersionID> <a:owner><![CDATA[SERVER\autostore]]></a:owner> <a:CreatedTimeStamp>1/21/2005 6:59:15 PM</a:CreatedTimeStamp> <a:DocumentState>approved</a:DocumentState> <a:IsCurrentVersion>False</a:IsCurrentVersion> <o:Title><![CDATA[ST. Cloud CC]]></o:Title> <a:bestbetcategories/> <a:bestbetkeywords/> <a:comment><![CDATA[]]></a:comment> <a:checkoutcomment><![CDATA[]]></a:checkoutcomment> </o:version> <o:version> <d:contentclass><![CDATA[urn:content-classes:basedocument]]>< +/d:contentclass> <o:source><![CDATA[\sources\some file2.pdf\some file2(2.0).pd +f]]></o:source> <a:FriendlyVersionID>2.0</a:FriendlyVersionID> <a:owner><![CDATA[SERVER\autostore]]></a:owner> <a:CreatedTimeStamp>1/21/2005 6:59:15 PM</a:CreatedTimeStamp> <a:DocumentState>approved</a:DocumentState> <a:IsCurrentVersion>False</a:IsCurrentVersion> <o:Author><![CDATA[Finance]]></o:Author> <o:Title><![CDATA[test]]></o:Title> <a:Categories><![CDATA[]]></a:Categories> <o:Description><![CDATA[ck]]></o:Description> <o:Keywords><![CDATA[ap;]]></o:Keywords> <a:bestbetcategories/> <a:bestbetkeywords/> <a:comment><![CDATA[]]></a:comment> <a:checkoutcomment><![CDATA[]]></a:checkoutcomment> </o:version> <o:version> <d:contentclass><![CDATA[urn:content-classes:basedocument]]>< +/d:contentclass> <o:source><![CDATA[\sources\some file3.pdf\some file3(2.1).pd +f]]></o:source> <a:FriendlyVersionID>2.1</a:FriendlyVersionID> <a:owner><![CDATA[SERVER\autostore]]></a:owner> <a:CreatedTimeStamp>1/21/2005 6:59:15 PM</a:CreatedTimeStamp> <a:DocumentState>createdcheckedin</a:DocumentState> <a:IsCurrentVersion>True</a:IsCurrentVersion> <o:Author><![CDATA[ap]]></o:Author> <o:Title><![CDATA[test]]></o:Title> <a:Categories><![CDATA[]]></a:Categories> <o:Description><![CDATA[ck]]></o:Description> <o:Keywords><![CDATA[ap;]]></o:Keywords> <a:bestbetcategories/> <a:bestbetkeywords/> <a:comment><![CDATA[]]></a:comment> <a:checkoutcomment><![CDATA[]]></a:checkoutcomment> </o:version> </cc:files>
Re: Repair malformed XML
by mirod (Canon) on Feb 03, 2005 at 17:27 UTC

    XML::LibXML is probably the way to go here, but here is an attempt using XML::Parser. The idea is just to automate the cycle "run parser - see it die - fix error" until the document passes. So the code runs XML::Parser, traps the error message, fix the original document and re-try, until no error message is found or the last error message is repeated, in which case it came accross an error that it could not fix. This is probably too slow to process an 80M file missing a lot of tags, but it is correct, as in "no XML weirndess is going to trip it", and could be extended to fix other types of errors.

    #!/usr/bin/perl -w use strict; use XML::Parser; my $file= 'crap.xml'; my $fixes=0; my @tags; # stack of tags used to figure out the last non closed tag my $p= XML::Parser->new( Handlers => { Start => sub { push @tags, $_[1 +]; }, End => sub { pop @tags; + }, }, ErrorContext => 1, ); my( $error, $last_error); do { $last_error= $error||''; undef $@; eval{ $p->parsefile( $file); }; #warn "error: $@ => close $tags[-1]\n" if( $@ && ($@ ne $last_erro +r)); if( $@=~ m{^\s*mismatched tag at line (\d+), column (\d+)}) { close_tag( $file, $tags[-1], $1, $2); $fixes++; } # you could add other types of fixes below } until( !$@ || ($@ eq $last_error)); if( $@) { print "could not fix the file: $@\n"; } else { print "success! ($fixes tags fixed)\n"; } sub close_tag { my( $file, $tag, $line, $column)= @_; my $temp= "crap.new"; open( my $in, '<', $file) or die "cannot open file (r) '$file': $ +!\n"; open( my $out, '>', $temp) or die "cannot open file (w) '$temp': $ +!\n"; # print the beginning of the file (untouched) for (1..$line-1) { print {$out} scalar <$in>; } # close the tag my $faulty_line=<$in>; # the reported column seems to be off by 3, but I suspect this mig +ht # vary depending on the xml prefix, so this looks safer my $real_column= rindex( $faulty_line, '<', $column) - 1; substr( $faulty_line, $real_column, 0)= "</$tag>\n"; print {$out} $faulty_line; # finish printing while( <$in>) { print {$out} $_; } close $in; close $out; rename $temp, $file or die "cannot replace file '$file' by new ver +sion in '$temp'"; }
Re: Repair malformed XML
by graff (Chancellor) on Feb 04, 2005 at 03:23 UTC
    Given the evidence you've shown (some examples where the markup comes out right as well as a case where it's wrong because of a missing close tag), I think there's sufficient cause to to assume that, even without an "official" DTD, you can figure out where to put the close tag. (The nature of the XML generation bug appears to be constrained in a way that allows a "heuristic, speculative" solution to do the right thing.)

    So here's how I would do it: read through the file one tag at a time (set the input-record-separator to ">"), maintain a stack of the open tags as they occur, pop them off the stack as their corresponding close-tags appear, and fill in missing close-tags where necessary.

    (This way of reading might be noticeably inefficient on large data sets, and there are ways to chunk through the data with fewer iterations on the diamond operator; but see if it works well enough before thinking about optimizing it.)

    #!/usr/bin/perl use strict; use warnings; $/ = '>'; # read up to the end of one tag at a time my $lineno = 0; my @tagstack; while (<>) { $lineno += tr/\n//; unless ( />$/ ) { # must be at eof; print; next; } my ( $pre, $tag ) = ( /([^<]*)(<[^>]+?)>/ ); if ( !defined( $tag )) { warn "'>' without prior '<' at or near line $lineno\n"; print; } elsif ( $tag =~ m{^</} ) { # close tag: look for its open tag on t +he stack unless ( @tagstack ) { warn "extra close tag '$tag' at or near line $lineno\n"; print; next; } $tag = substr $tag, 2; my $stackindx = $#tagstack; while ( $stackindx >= 0 and $tagstack[$stackindx] ne $tag ) { $stackindx--; } if ( $stackindx < 0 ) { warn "close tag '$tag' lacks open tag at or near line $lin +eno\n"; print; next; } print $pre; if ( $stackindx != $#tagstack ) { # add close tags as needed while ( $stackindx < $#tagstack ) { warn "added '</$tagstack[$#tagstack]>' at line $lineno +\n"; printf "</%s>\n", pop @tagstack; $lineno++; } } print "</$tag>"; pop @tagstack; } elsif ( $tag =~ m{^<!} or $tag =~ m{/$} ) { # "comment" or empty +tag print; } else { # this must be an open tag -- push it on the stack $tag =~ s/<(\S+).*/$1/s; push @tagstack, $tag; print; } }
    I tested this on the examples you gave -- the good ones came out unaltered, and the bad one had the close-tag added where needed.

      So really you want to write a quasi-XML parser. The problem is that it doesn't parse enough of XML : if you look at the data, you will see a lot of CDATA sections. This means that when you use > as the input record separator you are likely hit one in the middle of the CDATA, if you come accross a filename that includes a '>';. A filename like /Documents/some file><.pdf will trip your code.

      So if you want your hand-rolled parser to really work you will have to take into account that case. This can be done of course, you will have to take the string you have read, remove complete CDATA sections from it, and then figure out whether you are still in a CDATA section.

      My point is that it is not easy to deal with even that rather simple case. You end up having to write something that closer to a real XML parser. Actually something more tricky than a real XML parser, as the XML spec clearly states that parsers can die after they find any error in the XML. So you are now trying to write a recovering XML parser... or you could just use libxml's one, I am sure Daniel Veillard has spent more time working on this than any one here would ;--)

        No, he wants to write an XML tokenizer. Which would do the trick - that is, that will implement his algorithm. (An algorith of which no garantees can be made to be correct).
      I don't think your algorithm works. Yes, it will create a well-formed XML document, but that's not the same as repairing the document. Consider the following piece of (X)HTML:
      <P> foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>
      The </SPAN> tag is missing. Your algorithm will place it right in front of the </P>. It will repair the document to well-formedness (and in the case of (X)HTML, even to a valid document). But you don't know whether the </SPAN> really belongs there. Perhaps only the 'bar' was supposed to be inside the SPAN. Or maybe the first, but not the second, EM element belonged. Or perhaps it was a special DTD, that doesn't allow EM to appear inside SPAN. Then placing </SPAN> before </P> would be very wrong.

      If you have no way of verifying the result is correct - heck, you can't even verify whether the resulting document is syntactically valid - I'd advice you to leave the document as is. Then even the most basic check (for well-formedness) will flag the document to be incorrect. Otherwise, you end up with a document that appears to be correct, but you've no way of knowing. Of course, that raises the question, if you don't have the DTD, how useful is the document, and why is it being considered for "repair"?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://427666]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2014-10-25 15:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (145 votes), past polls