Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Repair malformed XML

by mirod (Canon)
on Feb 04, 2005 at 09:11 UTC ( [id://427993]=note: print w/replies, xml ) Need Help??


in reply to Re: Repair malformed XML
in thread Repair malformed XML

So really you want to write a quasi-XML parser. The problem is that it doesn't parse enough of XML : if you look at the data, you will see a lot of CDATA sections. This means that when you use > as the input record separator you are likely hit one in the middle of the CDATA, if you come accross a filename that includes a '>';. A filename like /Documents/some file><.pdf will trip your code.

So if you want your hand-rolled parser to really work you will have to take into account that case. This can be done of course, you will have to take the string you have read, remove complete CDATA sections from it, and then figure out whether you are still in a CDATA section.

My point is that it is not easy to deal with even that rather simple case. You end up having to write something that closer to a real XML parser. Actually something more tricky than a real XML parser, as the XML spec clearly states that parsers can die after they find any error in the XML. So you are now trying to write a recovering XML parser... or you could just use libxml's one, I am sure Daniel Veillard has spent more time working on this than any one here would ;--)

Replies are listed 'Best First'.
Re^3: Repair malformed XML
by Anonymous Monk on Feb 04, 2005 at 11:27 UTC
    No, he wants to write an XML tokenizer. Which would do the trick - that is, that will implement his algorithm. (An algorith of which no garantees can be made to be correct).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427993]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 02:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found