Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Repair malformed XML

by Anonymous Monk
on Feb 03, 2005 at 16:49 UTC ( [id://427705]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Repair malformed XML
in thread Repair malformed XML

Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

Replies are listed 'Best First'.
Re^4: Repair malformed XML
by Tanktalus (Canon) on Feb 03, 2005 at 17:22 UTC

    Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b.

    There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.

      Ah, I see. You mean (X)HTML, and 'a' is P, 'b' is SPAN, 'c' is CODE, 'd' is SMALL, 'e' is EM and 'f' is STRONG. So, now you encounter:
      <P>foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>
      Now, assuming tags aren't to be inserted inside words, I still can find five places to put in </SPAN>: before 'bar', between 'bar' and 'baz', after 'baz', between the two EM elements, and before the </P> tag.

      Now, if you have a DTD that says that the only possible content of a 'b' is exactly two 'e's, you know where the missing closing tag should have been.

      Note also that if you have a DTD where you can always unambigiously deduce where a missing closing tag should have gone, the closing tag is redundant - and if it were an SGML DTD instead of an XML DTD, the closing tag would have been optional. (And that would have solved the problem instantly - the document would be conforming).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427705]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-24 04:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found