|Problems? Is your data what you think it is?|
Re: Ö, Ä, Ü and XML::Simpleby ChemBoy (Priest)
|on May 03, 2002 at 07:21 UTC||Need Help??|
I've dealt with various aspects of this problem at different times, so let me take a stab here...
The first option that comes to mind is this: if XML::Simple can handle your character set, and your character set is an acceptable one for web browsers (such as ISO-Latin-1), why not just use the raw characters? Most browsers that can display the characters correctly at all can handle that character set, as far as I know.
However, I'll try to answer the opposite question as well (can't hurt, and might just be helpful).
The problem you're having is that XML::Simple does not recognise the entities you're passing it in your XML source. This is entirely appropriate--as far as I know, XML::Simple only understands basic XML entities, of which (again, as far as I know) there are very few: only & < and > (& < and >) spring to mind. Therefore, when it encounters something like ä, which is unquestionably an entity but not one it's familiar with, it does what every good XML parser does when it finds something unexpected: die.
The obvious solution to this is to tell the parser to recognize your entities, but there are two objections:
Why this last? Well, when the XML::Simple spits out your parsed data, it has already translated the entities in its input to the corresponding character data (much as the web browser will with the HTML entities). Which leaves us right where we started, really--if you can handle outputting ü to the browser, then just put it in your XML source to begin with.
However, this suggests the solution that I personally have used for this problem the few times I've encountered it: double-escape the data going into your XML source. That is, if you want to parse your XML and have it contain the string "é", arrange for your XML source file to contain the string "&eacute;". The alternative is to enclose the relevant sections in CDATA tags, which is acceptable for some things (including wholesale HTML markup in XML files) but generally overkill, in my opinion.
To actually do this programatically (assuming you're dealing with input that includes the literal characters you're trying to escape), you're probably best off with HTML::Entities, as mentioned above: it's distributed with HTML::Parser but does not partake of the weightyness of that module (or its need for compilation). If you have it installed, then something along these general lines should do the trick:
Possibly the lamest code example I've ever posted, that... I do suggest that comment, though, for the benefit of your associates and successors. If that doesn't encode all the characters you need encoded, check out the other parameters to that function--it can do what you need done.
Update: added print line to snippet, in a possibly doomed attempt to make it resemble actual code.
Update: doh! Working too hard and thinking too little--XMLout does, of course, escape XML entities, so only one round of HTML escaping is called for (if you're using XMLout). Thanks to ajt for the catch!
If God had meant us to fly, he would *never* have given us the railroads.