Re^3: Repair malformed XML

Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

Comment on Re^3: Repair malformed XML

Replies are listed 'Best First'.
Re^4: Repair malformed XML by Tanktalus (Canon) on Feb 03, 2005 at 17:22 UTC
Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b. There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.	[reply]
Re^5: Repair malformed XML by Anonymous Monk on Feb 03, 2005 at 17:38 UTC
Ah, I see. You mean (X)HTML, and 'a' is `P`, 'b' is `SPAN`, 'c' is `CODE`, 'd' is `SMALL`, 'e' is `EM` and 'f' is `STRONG`. So, now you encounter: `<P>foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>` [download] Now, assuming tags aren't to be inserted inside words, I still can find five places to put in `</SPAN>`: before 'bar', between 'bar' and 'baz', after 'baz', between the two `EM` elements, and before the `</P>` tag. Now, if you have a DTD that says that the only possible content of a 'b' is exactly two 'e's, you know where the missing closing tag should have been. Note also that if you have a DTD where you can always unambigiously deduce where a missing closing tag should have gone, the closing tag is redundant - and if it were an SGML DTD instead of an XML DTD, the closing tag would have been optional. (And that would have solved the problem instantly - the document would be conforming).	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks