Re^2: Repair malformed XML

So really you want to write a quasi-XML parser. The problem is that it doesn't parse enough of XML : if you look at the data, you will see a lot of CDATA sections. This means that when you use > as the input record separator you are likely hit one in the middle of the CDATA, if you come accross a filename that includes a '>';. A filename like /Documents/some file><.pdf will trip your code.

So if you want your hand-rolled parser to really work you will have to take into account that case. This can be done of course, you will have to take the string you have read, remove complete CDATA sections from it, and then figure out whether you are still in a CDATA section.

My point is that it is not easy to deal with even that rather simple case. You end up having to write something that closer to a real XML parser. Actually something more tricky than a real XML parser, as the XML spec clearly states that parsers can die after they find any error in the XML. So you are now trying to write a recovering XML parser... or you could just use libxml's one, I am sure Daniel Veillard has spent more time working on this than any one here would ;--)

Comment on Re^2: Repair malformed XML Download Code

Replies are listed 'Best First'.
Re^3: Repair malformed XML by Anonymous Monk on Feb 04, 2005 at 11:27 UTC
No, he wants to write an XML tokenizer. Which would do the trick - that is, that will implement his algorithm. (An algorith of which no garantees can be made to be correct).	[reply]


Keep It Simple, Stupid
	PerlMonks