|Perl: the Markov chain saw|
i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'by bcrowell2 (Friar)
|on Feb 25, 2008 at 01:25 UTC||Need Help??|
bcrowell2 has asked for the
wisdom of the Perl Monks concerning the following question:
I have a program that reads utf8 from a file, and writes utf8 to stdout. It's internationalized in a bunch of languages. The relevant section of the code seems to be the following:
As you can see from the length of the comments, it hasn't been as straightforward as I would have liked to make this Just Work for my users.
The latest problem has to do with the line 'binmode STDOUT, ":utf8";'. This was needed in order to avoid a "Wide character in print" error in Czech. However, adding that line seems to have broken the program for a Danish-speaking user. If he uses a utf8-encoded input file with a ligatured ae character (c3a6), he gets errors like 'utf8 "\xF8" does not map to Unicode at ./when line 1389, <FILE> line 29.' I do not get the same error on the same input file on my own machine. He's running Debian Etch with LANG=en_US.ISO-8859-15 LC_CTYPE=C, and a US keyboard layout. I'm running Ubuntu Gutsy with a US setup. I need to check back with him, but it sounds as though the utf8 codes that perl is complaining about are different than the ones that are actually in his input file -- they all have F and E in the LSB. (I'm checking back with him on this, since there's some confusion in the emails.)
The Wikipedia article on the ae character, http://en.wikipedia.org/wiki/%C3%86 , says it's unicode e6. Maybe this is a character that can be encoded in unicode in two different ways? If I display c3a6 in a unicode-aware terminal like mlterm, it does display as a ligatured ae. Maybe perl is trying to convert it to the single-character version, or something??
Does anyone have any clue what might be happening here?