Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

by bcrowell2 (Friar)
on Feb 25, 2008 at 01:25 UTC ( #669902=perlquestion: print w/ replies, xml ) Need Help??
bcrowell2 has asked for the wisdom of the Perl Monks concerning the following question:

O Monks,

I have a program that reads utf8 from a file, and writes utf8 to stdout. It's internationalized in a bunch of languages. The relevant section of the code seems to be the following:

binmode STDOUT, ":utf8"; # eliminates "Wide character in print" error +in Czech use open ":encoding(utf8)"; # otherwise utf8 in input files is read as + if 1 character==1 byte # The combination of two lines above is needed in order to get the fol +lowing to work: # - Czech characters coded into the source print without the "Wide +character in print" error. # - Accented characters and Greek characters in the input file are +read properly and printed back out properly. # When testing this, make sure to use a terminal such as mlterm that c +an handle accented characters, # and make sure that the --nofilter_accents_on_output has not been set + automatically based on the # value of the $TERM variable. (Using mlterm prevents this.) # See "man perlunicode". # An example of the confusing way all of this works: # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' >a.a # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print $x' # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print length $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print length $x' use utf8; # Indicates that source can contain utf8, which we use for t +he Greek translation. use locale;

As you can see from the length of the comments, it hasn't been as straightforward as I would have liked to make this Just Work for my users.

The latest problem has to do with the line 'binmode STDOUT, ":utf8";'. This was needed in order to avoid a "Wide character in print" error in Czech. However, adding that line seems to have broken the program for a Danish-speaking user. If he uses a utf8-encoded input file with a ligatured ae character (c3a6), he gets errors like 'utf8 "\xF8" does not map to Unicode at ./when line 1389, <FILE> line 29.' I do not get the same error on the same input file on my own machine. He's running Debian Etch with LANG=en_US.ISO-8859-15 LC_CTYPE=C, and a US keyboard layout. I'm running Ubuntu Gutsy with a US setup. I need to check back with him, but it sounds as though the utf8 codes that perl is complaining about are different than the ones that are actually in his input file -- they all have F and E in the LSB. (I'm checking back with him on this, since there's some confusion in the emails.)

The Wikipedia article on the ae character, http://en.wikipedia.org/wiki/%C3%86 , says it's unicode e6. Maybe this is a character that can be encoded in unicode in two different ways? If I display c3a6 in a unicode-aware terminal like mlterm, it does display as a ligatured ae. Maybe perl is trying to convert it to the single-character version, or something??

Does anyone have any clue what might be happening here?

TIA!

Comment on i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
Download Code
Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by Juerd (Abbot) on Feb 25, 2008 at 02:06 UTC

    How are you READING the UTF-8 data? Outputting is hard to do wrong. Indeed you just set an :encoding or :utf8 layer on the output handle.

    However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input.

    The error message about 0xF8 (which is the Danish character, not , which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Update: I meant :encoding(utf8) here. ":utf8" should of course not be used for input.

    If the input is ISO-8859, and the input layer is :utf8, you get lots of errors and you should be happy if any part of your program works correctly. Probably not the case here.

    If the input is ISO-8859, and the input layer is :encoding(utf8), you get substitution characters for practically all non-ASCII characters.

    The only correct way to read a ISO-8859-15 text file or stream, is to use :encoding(ISO-8859-15). This can be done automatically based on the locale, with "use open", see its documentation. Note that using that is likely to introduce problems for other users, especially those who don't have any locale, but do have a UTF-8 capable terminal. This, however, is not a Perl problem.

    If you haven't already done so, please forget everything you've ever read and learned about Perl unicode support, and read perlunitut.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Thanks, Juerd, for your reply.

      >How are you READING the UTF-8 data? ... However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input.

      I think that's what I did -- see the second line of the code:

      use open ":encoding(utf8)";

      >The error message about 0xF8 (which is the Danish character, not , which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used.

      Aha. I checked the file the user sent me:

      $ file a.a a.a: ISO-8859 text

      My program requires utf8 input, but the user was giving it iso-8859. I think when I cut and pasted it in a utf8-aware editor, it got changed into utf8.

      Although my documentation states that the input file has to be utf8, is there any way I can make an explicit check for a bogus encoding? I suppose the crudest thing I could do would be to look at the output of the unix "file" command, but I wonder if there's something more elegant.

        use open ":encoding(utf8)";

        Good.

        My program requires utf8 input, but the user was giving it iso-8859.

        There's no easy way to fix the user. :)

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Okay, here's what I came up with to test whether a file is valid utf8. I'm sure there's also some way to do this using a cpan module.

      sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf +8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than + calling this directly. sub is_valid_utf8 { my $x = shift; my $leading0 = '[\x{0}-\x{7f}]'; my $leading10 = '[\x{80}-\x{bf}]'; my $leading110 = '[\x{c0}-\x{df}]'; my $leading1110 = '[\x{e0}-\x{ef}]'; my $leading11110 = '[\x{f0}-\x{f7}]'; my $utf8 = "($leading0|($leading110$leading10)|($leading1110$leading +10$leading10)|($leading11110$leading10$leading10$leading10))*"; return ($x=~/^$utf8$/); }

        If you have the raw bytestring, the easiest way to see if it's valid UTF-8 is to decode it to a unicode string. If that fails, it wasn't utf8 enough :)

        utf8::decode($string) or die "Input is not valid UTF-8";
        or
        utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";
        If you leave out the "or die" clause, any invalid UTF-8 will just be seen as ISO-8859-1.

        Update: changed the examples as per ikegami's sound response.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by bcrowell2 (Friar) on Feb 25, 2008 at 23:13 UTC
    Thanks, Juerg and ikegami, for your help! The utf8::decode solution is obviously cleaner (and probably faster) than my hand-coded version.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://669902]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2014-08-30 04:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls