Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

by Juerd (Abbot)
on Feb 25, 2008 at 02:06 UTC ( #669909=note: print w/ replies, xml ) Need Help??


in reply to i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

How are you READING the UTF-8 data? Outputting is hard to do wrong. Indeed you just set an :encoding or :utf8 layer on the output handle.

However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input.

The error message about 0xF8 (which is the Danish character, not , which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Update: I meant :encoding(utf8) here. ":utf8" should of course not be used for input.

If the input is ISO-8859, and the input layer is :utf8, you get lots of errors and you should be happy if any part of your program works correctly. Probably not the case here.

If the input is ISO-8859, and the input layer is :encoding(utf8), you get substitution characters for practically all non-ASCII characters.

The only correct way to read a ISO-8859-15 text file or stream, is to use :encoding(ISO-8859-15). This can be done automatically based on the locale, with "use open", see its documentation. Note that using that is likely to introduce problems for other users, especially those who don't have any locale, but do have a UTF-8 capable terminal. This, however, is not a Perl problem.

If you haven't already done so, please forget everything you've ever read and learned about Perl unicode support, and read perlunitut.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }


Comment on Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by bcrowell2 (Friar) on Feb 25, 2008 at 02:29 UTC

    Thanks, Juerd, for your reply.

    >How are you READING the UTF-8 data? ... However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input.

    I think that's what I did -- see the second line of the code:

    use open ":encoding(utf8)";

    >The error message about 0xF8 (which is the Danish character, not , which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used.

    Aha. I checked the file the user sent me:

    $ file a.a a.a: ISO-8859 text

    My program requires utf8 input, but the user was giving it iso-8859. I think when I cut and pasted it in a utf8-aware editor, it got changed into utf8.

    Although my documentation states that the input file has to be utf8, is there any way I can make an explicit check for a bogus encoding? I suppose the crudest thing I could do would be to look at the output of the unix "file" command, but I wonder if there's something more elegant.

      use open ":encoding(utf8)";

      Good.

      My program requires utf8 input, but the user was giving it iso-8859.

      There's no easy way to fix the user. :)

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by bcrowell2 (Friar) on Feb 25, 2008 at 04:43 UTC

    Okay, here's what I came up with to test whether a file is valid utf8. I'm sure there's also some way to do this using a cpan module.

    sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf +8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than + calling this directly. sub is_valid_utf8 { my $x = shift; my $leading0 = '[\x{0}-\x{7f}]'; my $leading10 = '[\x{80}-\x{bf}]'; my $leading110 = '[\x{c0}-\x{df}]'; my $leading1110 = '[\x{e0}-\x{ef}]'; my $leading11110 = '[\x{f0}-\x{f7}]'; my $utf8 = "($leading0|($leading110$leading10)|($leading1110$leading +10$leading10)|($leading11110$leading10$leading10$leading10))*"; return ($x=~/^$utf8$/); }

      If you have the raw bytestring, the easiest way to see if it's valid UTF-8 is to decode it to a unicode string. If that fails, it wasn't utf8 enough :)

      utf8::decode($string) or die "Input is not valid UTF-8";
      or
      utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";
      If you leave out the "or die" clause, any invalid UTF-8 will just be seen as ISO-8859-1.

      Update: changed the examples as per ikegami's sound response.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        That should be
        utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";

        It works in-place.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://669909]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (13)
As of 2014-04-23 17:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (549 votes), past polls