UTF-8 entities in XML/HTML?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To start with an overview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

Most of the things work, just the UTF-8 encoding of special entities for the XML drives me nuts. I have been deploying the XML::RSS Perl module, which might or might not be a good idea afterOverview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

It also seems XML::RSS does not correctly support the common way of encoding/decoding UTF-8 entities. In those cases where the mentioned "encode_output" of XML::RSS does work it produces something like this for the lower-case 'a' with two dots on top: testing it. E.g. sometimes the encode_output switch is being ignored depending on which server I execute the script.

Ã¤

When XML::RSS reads in entities like this it gets correctly decoded. But some common RSS readers are not swallowing it. It _seems_ that the encoding for the above example should have been:

쎤

When XML::RSS reads entities like this, something goes wrong and I see funny characters as result. After some frustration I had written the encoding myself like the second version, which solves the encoding part but XML::RSS does not like it.

Are both ways above correct when encoding UTF-8 in XML?
Is using XML::RSS a bad idea? Any alternatives?
How to best encode those entities in HTML for output?
Side topic: Could it be that RSS readers better support decimal encoding, e.g. 쎤 than the hexadecimal equivalent 쎤 ?

Would be really glad if someone could help with any part of it

Jot

20080909 Janitored by Corion: Added closing list tag, as per Writeup Formatting Tips

Comment on UTF-8 entities in XML/HTML? Select or Download Code

Replies are listed 'Best First'.
Re: UTF-8 entities in XML/HTML? by moritz (Cardinal) on Sep 03, 2008 at 14:00 UTC
`Ã¤` This encodes two characters, not one, so it's certainly not what you want. To avoid getting something like, decode your strings with Encode::decode and then apply `utf8::upgrade` on the string. (That last step may not be necessary if all you want is to entity-encode the string). See also Perl an Character Encodings, Encode, perluniintro, perlunicode, perlunifaq.	[reply] [d/l] [select]
Re^2: UTF-8 entities in XML/HTML? by Anonymous Monk on Sep 03, 2008 at 15:37 UTC
Juerd wrote: > Are you sure your data is properly decoded when > you read it from file/socket/database? Thanks for answering, Juerd. The script reads it from a RSS file and I have just double-checked: If the bytes are separately encoded, like `Ã¤` XML::RSS decodes it correctly. How I know it's correct? Well, I am able to read the character displayed within the HTML output (in the HTML source it's unencoded, but since Firefox thinks the encoding is UTF-8, actually set via HTML header, it displays the character as expected. Perhaps interesting regarding my just executed test, if I replace one or all instances of separate bytes entities with the (supposedly correct) single code version I get this error in Apache's log: Wide character in print. And what I see in the browser are little squares that contain tiny hex numbers, e.g. C3 and A4. If all entities are separate-bytes encoded, there is no error. -- moritz wrote: > ...This encodes two characters, not one, > so it's certainly not what you want. Thanks for your answer! Great, this confirms my finding. I will try out what you suggested (`Encode::decode + utf8::upgrade`). Jot	[reply] [d/l] [select]
Re: UTF-8 entities in XML/HTML? by pat_mc (Pilgrim) on Sep 03, 2008 at 15:08 UTC
Hi, Jot - I have to admit I am not familiar with most of the modules you mention so my advice below may be completely off the mark. I have, however, also worked with large UTF-8 encoded XML files before that contained German language entities, some of which were sensitive to character encoding (especially the umlauts). The solution that worked best for me was to use the Linux command `recode encoding_old..encoding_new file_name` where `encoding_old` and `encoding_new` are the character encodings between which you want to re-encode your XML file `file_name`. I found it easiest to convert your whole XML file into the appropriate character encoding before processing it and then re-convert it once you are done. To find out which character sets your system has available for re-encoding use `recode –l`. Again, I am not sure this really helps. Apologies if it doesn't. Cheers - Pat	[reply] [d/l] [select]
Re^2: UTF-8 entities in XML/HTML? by Anonymous Monk on Sep 03, 2008 at 16:01 UTC
- it does help, thanks! Your answer is relevant for my data migration phase I haven't tackled yet. Momentarily I try to make the script working in a clean way. Phase two will be converting the old messages to UTF-8, where your hints will become handy. Since you have touched German entities before, I have heard there are three ways to encode them (at least in HTML), one is via named entity, like `ä`, another via one-byte numeric code situated between 128..255. And third is the two-bytes one I am trying to achieve (as I thought it's a more generic way to encode things). Any idea what should be the preference? Jot	[reply] [d/l]
Re^3: UTF-8 entities in XML/HTML? by pat_mc (Pilgrim) on Sep 04, 2008 at 16:29 UTC
Hi, Jot - I was working with the SALSA corpus of syntactically and semantically annotated German newspaper sentences. The corpus follows the TIGER annotation standards. In the corpus, a UTF-8 encoded lowercase German a-umlaut ('ä'), e.g., would be rendered in ISO-8859-1 as `Ã¤`. I am not sure, however, which encoding variant of those you mention this corresponds to. Hope this helps anyway. Pat	[reply] [d/l]
Re: UTF-8 entities in XML/HTML? by Juerd (Abbot) on Sep 03, 2008 at 13:55 UTC
Are you sure your data is properly decoded when you read it from file/socket/database?	[reply]

Back to Seekers of Perl Wisdom