This is not an easy problem. It's quite easy to get partial solutions, and real hard to get a perfect one. The good news is that as long as you are conservative in what you fix, then the XML parser will tell you about what you missed and no one will be hurt in the process ;--)
Also I would assume that what you get is not pathological, designed to trip the parser, but more like "XML by dummies", who don't know the spec, or what a parser is. So probably no CDATA section, no comments, no '>' in attribute values.
My first attempt would look like this:
If we find 2 successive '>' without a '<' in between, then the second '>' should be turned into an entity (the first one closes a tag, but not the second one). Same with 2 successive '<' without a '>' in between, the first '< is not part of the markup (the second one opens a tag, but not the first one). For &, if it doesn't look like an entity, &name; or ..., then turn it into &
#!/usr/bin/perl
use strict;
use warnings;
while( <DATA>)
{ s{>([^<]*)>}{>$1>}g;
s{<([^>]*)<}{>$1<}g;
s{&(?!\w+;|#)}{&}g;
print;
}
__DATA__
<doc><data>></data><data>if( 1 < 2 && 2 < 3)</data></doc>
This doesn't catch the case of an < / > pair that's not part of a tag, as in 'if( $a<$b || $a > $c)'. You can improve this by first trying to catch separately <s, they're easier than
>s, as if they are not followed by /?\w+, then they can't be mark-up (once again a simplification, the first character of the tag name can't be a digit).
Also some constructs that might look like entities but are not, like '&#foo', and you could also improve the regexp there. But we are getting to the limits of what's reasonable here.
It all depends of what you want. Limit the number of cases where you have to manually fix the data, or never encounter any well-formedness error.
<pEdited: improved explanations (hopefully!) |