Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: , , and XML::Simple

by ChemBoy (Priest)
on May 03, 2002 at 07:21 UTC ( #163724=note: print w/replies, xml ) Need Help??

in reply to , , and XML::Simple

I've dealt with various aspects of this problem at different times, so let me take a stab here...

The first option that comes to mind is this: if XML::Simple can handle your character set, and your character set is an acceptable one for web browsers (such as ISO-Latin-1), why not just use the raw characters? Most browsers that can display the characters correctly at all can handle that character set, as far as I know.

However, I'll try to answer the opposite question as well (can't hurt, and might just be helpful).

The problem you're having is that XML::Simple does not recognise the entities you're passing it in your XML source. This is entirely appropriate--as far as I know, XML::Simple only understands basic XML entities, of which (again, as far as I know) there are very few: only & < and > (&amp; &lt; and &gt;) spring to mind. Therefore, when it encounters something like &auml;, which is unquestionably an entity but not one it's familiar with, it does what every good XML parser does when it finds something unexpected: die.

The obvious solution to this is to tell the parser to recognize your entities, but there are two objections:

  1. that could easily get rather un-simple
  2. that would definitely defeat your original purpose

Why this last? Well, when the XML::Simple spits out your parsed data, it has already translated the entities in its input to the corresponding character data (much as the web browser will with the HTML entities). Which leaves us right where we started, really--if you can handle outputting to the browser, then just put it in your XML source to begin with.

However, this suggests the solution that I personally have used for this problem the few times I've encountered it: double-escape the data going into your XML source. That is, if you want to parse your XML and have it contain the string "&eacute;", arrange for your XML source file to contain the string "&amp;eacute;". The alternative is to enclose the relevant sections in CDATA tags, which is acceptable for some things (including wholesale HTML markup in XML files) but generally overkill, in my opinion.

To actually do this programatically (assuming you're dealing with input that includes the literal characters you're trying to escape), you're probably best off with HTML::Entities, as mentioned above: it's distributed with HTML::Parser but does not partake of the weightyness of that module (or its need for compilation). If you have it installed, then something along these general lines should do the trick:

use HTML::Entities; while (<TEXT_FILE>) { encode_entities $_;
encode_entities $_; # yes, really twice
do_stuff($_); } print XMLout ($foo); # the data structure built by do_stuff()

Possibly the lamest code example I've ever posted, that... I do suggest that comment, though, for the benefit of your associates and successors. If that doesn't encode all the characters you need encoded, check out the other parameters to that function--it can do what you need done.

Good luck!

Update: added print line to snippet, in a possibly doomed attempt to make it resemble actual code.

Update: doh! Working too hard and thinking too little--XMLout does, of course, escape XML entities, so only one round of HTML escaping is called for (if you're using XMLout). Thanks to ajt for the catch!

If God had meant us to fly, he would *never* have given us the railroads.
    --Michael Flanders

Replies are listed 'Best First'.
Re: Re: , , and XML::Simple
by BioHazard (Pilgrim) on May 03, 2002 at 14:33 UTC
    Thank you very much!

    What I needed was a connection of mojotoads and ChemBoys suggestion. I have not thought that UTF-8 does not take etc. With ISO-8859-2 the script does not die. And with this double encoding like " & amp;auml; " the Browser prints out the string I actually wanted.

    again, thank you for helping me!


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://163724]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2021-04-21 10:18 GMT
Find Nodes?
    Voting Booth?

    No recent polls found