Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

in reply to i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

How are you READING the UTF-8 data? Outputting is hard to do wrong. Indeed you just set an :encoding or :utf8 layer on the output handle.

However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input.

The error message about 0xF8 (which is the Danish ų character, not ę, which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Update: I meant :encoding(utf8) here. ":utf8" should of course not be used for input.

If the input is ISO-8859, and the input layer is :utf8, you get lots of errors and you should be happy if any part of your program works correctly. Probably not the case here.

If the input is ISO-8859, and the input layer is :encoding(utf8), you get substitution characters for practically all non-ASCII characters.

The only correct way to read a ISO-8859-15 text file or stream, is to use :encoding(ISO-8859-15). This can be done automatically based on the locale, with "use open", see its documentation. Note that using that is likely to introduce problems for other users, especially those who don't have any locale, but do have a UTF-8 capable terminal. This, however, is not a Perl problem.

If you haven't already done so, please forget everything you've ever read and learned about Perl unicode support, and read perlunitut.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Comment on Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

Replies are listed 'Best First'.
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2 (Friar) on Feb 25, 2008 at 02:29 UTC
Thanks, Juerd, for your reply. >How are you READING the UTF-8 data? ... However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input. I think that's what I did -- see the second line of the code: `use open ":encoding(utf8)";` >The error message about 0xF8 (which is the Danish ų character, not ę, which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Aha. I checked the file the user sent me: `$ file a.a a.a: ISO-8859 text` [download] My program requires utf8 input, but the user was giving it iso-8859. I think when I cut and pasted it in a utf8-aware editor, it got changed into utf8. Although my documentation states that the input file has to be utf8, is there any way I can make an explicit check for a bogus encoding? I suppose the crudest thing I could do would be to look at the output of the unix "file" command, but I wonder if there's something more elegant.	[reply] [d/l] [select]
Re^3: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by Juerd (Abbot) on Feb 25, 2008 at 10:23 UTC
use open ":encoding(utf8)"; Good. My program requires utf8 input, but the user was giving it iso-8859. There's no easy way to fix the user. :) Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2 (Friar) on Feb 25, 2008 at 04:43 UTC
Okay, here's what I came up with to test whether a file is valid utf8. I'm sure there's also some way to do this using a cpan module. sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf +8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than + calling this directly. sub is_valid_utf8 { my $x = shift; my $leading0 = '[\x{0}-\x{7f}]'; my $leading10 = '[\x{80}-\x{bf}]'; my $leading110 = '[\x{c0}-\x{df}]'; my $leading1110 = '[\x{e0}-\x{ef}]'; my $leading11110 = '[\x{f0}-\x{f7}]'; my $utf8 = "($leading0\|($leading110$leading10)\|($leading1110$leading +10$leading10)\|($leading11110$leading10$leading10$leading10))*"; return ($x=~/^$utf8$/); } [download]	[reply] [d/l]
Re^3: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by Juerd (Abbot) on Feb 25, 2008 at 15:04 UTC
If you have the raw bytestring, the easiest way to see if it's valid UTF-8 is to decode it to a unicode string. If that fails, it wasn't utf8 enough :) `utf8::decode($string) or die "Input is not valid UTF-8";` [download] or `utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";` [download] If you leave out the "or die" clause, any invalid UTF-8 will just be seen as ISO-8859-1. Update: changed the examples as per ikegami's sound response. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l] [select]
Re^4: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by ikegami (Patriarch) on Feb 25, 2008 at 17:40 UTC
That should be `utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";` [download] It works in-place.	[reply] [d/l]
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by shagbark (Acolyte) on Oct 22, 2014 at 01:31 UTC
~~question about why not to use :utf8 that was answered above~~ ... anybody know how I can delete this comment?	[reply]

In Section Seekers of Perl Wisdom