http://www.perlmonks.org?node_id=708751

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To start with an overview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

Most of the things work, just the UTF-8 encoding of special entities for the XML drives me nuts. I have been deploying the XML::RSS Perl module, which might or might not be a good idea afterOverview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

Most of the things work, just the UTF-8 encoding of special entities for the XML drives me nuts. I have been deploying the XML::RSS Perl module, which might or might not be a good idea after testing it. E.g. sometimes the encode_output switch is being ignored depending on which server I execute the script.

It also seems XML::RSS does not correctly support the common way of encoding/decoding UTF-8 entities. In those cases where the mentioned "encode_output" of XML::RSS does work it produces something like this for the lower-case 'a' with two dots on top: testing it. E.g. sometimes the encode_output switch is being ignored depending on which server I execute the script.

It also seems XML::RSS does not correctly support the common way of encoding/decoding UTF-8 entities. In those cases where the mentioned "encode_output" of XML::RSS does work it produces something like this for the lower-case 'a' with two dots on top:

ä

When XML::RSS reads in entities like this it gets correctly decoded. But some common RSS readers are not swallowing it. It _seems_ that the encoding for the above example should have been:

쎤

When XML::RSS reads entities like this, something goes wrong and I see funny characters as result. After some frustration I had written the encoding myself like the second version, which solves the encoding part but XML::RSS does not like it.

    My questions are:
  1. Are both ways above correct when encoding UTF-8 in XML?
  2. Is using XML::RSS a bad idea? Any alternatives?
  3. How to best encode those entities in HTML for output?
  4. Side topic: Could it be that RSS readers better support decimal encoding, e.g. 쎤 than the hexadecimal equivalent 쎤 ?

Would be really glad if someone could help with any part of it

Jot

20080909 Janitored by Corion: Added closing list tag, as per Writeup Formatting Tips

Replies are listed 'Best First'.
Re: UTF-8 entities in XML/HTML?
by moritz (Cardinal) on Sep 03, 2008 at 14:00 UTC
    ä

    This encodes two characters, not one, so it's certainly not what you want.

    To avoid getting something like, decode your strings with Encode::decode and then apply utf8::upgrade on the string. (That last step may not be necessary if all you want is to entity-encode the string).

    See also Perl an Character Encodings, Encode, perluniintro, perlunicode, perlunifaq.

      Juerd wrote:
      > Are you sure your data is properly *decoded* when
      > you read it from file/socket/database?

      Thanks for answering, Juerd. The script reads it from a RSS file and I have just double-checked: If the bytes are separately encoded, like ä XML::RSS decodes it correctly. How I know it's correct? Well, I am able to read the character displayed within the HTML output (in the HTML source it's unencoded, but since Firefox thinks the encoding is UTF-8, actually set via HTML header, it displays the character as expected.

      Perhaps interesting regarding my just executed test, if I replace one or all instances of separate bytes entities with the (supposedly correct) single code version I get this error in Apache's log: Wide character in print. And what I see in the browser are little squares that contain tiny hex numbers, e.g. C3 and A4.

      If all entities are separate-bytes encoded, there is no error.

      --
      moritz wrote:
      > ...This encodes two characters, not one,
      > so it's certainly not what you want.

      Thanks for your answer! Great, this confirms my finding.

      I will try out what you suggested (Encode::decode + utf8::upgrade).

      Jot

Re: UTF-8 entities in XML/HTML?
by pat_mc (Pilgrim) on Sep 03, 2008 at 15:08 UTC
    Hi, Jot -

    I have to admit I am not familiar with most of the modules you mention so my advice below may be completely off the mark. I have, however, also worked with large UTF-8 encoded XML files before that contained German language entities, some of which were sensitive to character encoding (especially the umlauts).

    The solution that worked best for me was to use the Linux command recode encoding_old..encoding_new file_name where encoding_old and encoding_new are the character encodings between which you want to re-encode your XML file file_name. I found it easiest to convert your whole XML file into the appropriate character encoding before processing it and then re-convert it once you are done. To find out which character sets your system has available for re-encoding use recode –l.

    Again, I am not sure this really helps. Apologies if it doesn't.

    Cheers -

    Pat

      - it does help, thanks! Your answer is relevant for my data migration phase I haven't tackled yet. Momentarily I try to make the script working in a clean way. Phase two will be converting the old messages to UTF-8, where your hints will become handy.

      Since you have touched German entities before, I have heard there are three ways to encode them (at least in HTML), one is via named entity, like ä, another via one-byte numeric code situated between 128..255. And third is the two-bytes one I am trying to achieve (as I thought it's a more generic way to encode things). Any idea what should be the preference?

      Jot

        Hi, Jot -

        I was working with the SALSA corpus of syntactically and semantically annotated German newspaper sentences. The corpus follows the TIGER annotation standards.

        In the corpus, a UTF-8 encoded lowercase German a-umlaut ('ä'), e.g., would be rendered in ISO-8859-1 as ä. I am not sure, however, which encoding variant of those you mention this corresponds to.

        Hope this helps anyway.

        Pat
Re: UTF-8 entities in XML/HTML?
by Juerd (Abbot) on Sep 03, 2008 at 13:55 UTC
    Are you sure your data is properly *decoded* when you read it from file/socket/database?