Re: The Queensrÿche Situation
by aitap (Curate) on Oct 19, 2014 at 19:00 UTC
|
You didn't use binmode to apply an IOLayer to encode Unicode characters you print to STDOUT, neither you encode them manually. When Perl encounters characters where it expects bytes (in any IO) it applies some heuristics to translate the former to the latter. Usually it means that what can be translated to latin1 gets (silently!) translated and everything else is printed in utf8 (with a warning):
$ perl -w -Mutf8 -E'say "ы"; say "ÿ";'
Wide character in say at -e line 1.
ы
�
(my terminal is utf-8)
And when you use utf8, Perl decodes utf8 byte string literals into characters for you. The same is done by Encode::decode.
Does adding binmode STDOUT, ":utf8"; fix your problem? You can also use :encoding(...) IOLayers to encode into other encodings. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
#!/usr/bin/perl
use strict;
use Encode;
use Text::Unaccent::PurePerl;
binmode STDOUT, ":utf8";
use utf8;
my $string = "Queensrÿche";
no utf8;
chars($string);
(Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO
+T utf8\n";
print "$string\n";
print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string)
+. "\n";
exit;
sub chars {
my $k = shift;
my @chars = split("",$k);
foreach (@chars) {
my $dec = ord($_);
my $chr = chr(ord($_));
my $q = qquote($_);
print "\t$dec\t$chr\t$q\n";
}
}
sub qquote {
local($_) = shift;
s/([\\\"\@\$])/\\$1/g;
my $bytes; { use bytes; $bytes = length }
s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes
+> length;
return $_;
Why does that produce, this:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
255 ÿ \x{ff}
99 c c
104 h h
101 e e
this is utf8
Queensrÿche
unaccented: Queensryche
Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 - \x{c3}
191 - \x{bf}
99 c c
104 h h
101 e e
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example,
use utf8;
binmode STDOUT, ":utf8";
my $string = "Queensrÿche ы";
printf "%x\t%s\n", ord($_), $_ for split "", $string;
__END__
51 Q
75 u
65 e
65 e
6e n
73 s
72 r
ff ÿ
63 c
68 h
65 e
20
44b ы
If you need to work with utf-8 bytes, encode them back:
use utf8;
use Encode 'encode';
binmode STDOUT, ":utf8";
my $string = "Queensrÿche ы";
printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string;
__END__
51 Q
75 u
65 e
65 e
6e n
73 s
72 r
c3 Ã
bf ¿
63 c
68 h
65 e
20
d1 Ñ
8b
But there would be no point in using utf8 and Encode in this case. | [reply] [Watch: Dir/Any] |
|
|
"Yes! That fixes the printing problem in my terminal!"
Thats nice. But just to add a little bit confusion., please see this:
A One-Liner prints it out as expected:
karl$ perl -e 'print qq(Queensrÿche\n)'
Queensrÿche
But please see what happens when i put the stuff into a script (in the same terminal session):
#!/usr/bin/env perl
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $string = qq(Queensrÿche);
print qq($string\n);
my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS});
print qq($y_with_trema\n);
$string = qq(Queensr) . $y_with_trema . qq(che);
print qq($string\n);
__END__
karls-mac-mini:monks karl$ ./roadster001.pl
Queensrÿche
ÿ
Queensrÿche
Seems like things are getting weird. I wonder when i ever will understand this crap.
N.B.: I came in a bit late and didn't read all the posts yet.
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com
So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example:
Decimal Char escaped
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 - \x{c3}
191 - \x{bf}
99 c c
104 h h
101 e e
It is now printing on my terminal like this:
Queensrÿche
This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on). | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: The Queensrÿche Situation
by Jim (Curate) on Oct 19, 2014 at 20:39 UTC
|
If you only have to deal with Unicode—and you properly should only have to deal with Unicode in this millennium—then use the Unicode collation algorithm instead of something non-standard. In Perl, this means using Unicode::Collate. Both the Unicode collation algorithm and the Perl CPAN module Unicode::Collate are customizable.
use strict;
use warnings;
# This Perl script is Unicode UTF-8
use utf8;
# Proper Unicode collation
use Unicode::Collate;
# The output of this Perl script is Unicode UTF-8
binmode STDOUT, ':encoding(UTF-8)';
my $fancy = 'Queensrÿche';
my $plain = 'Queensryche';
my $collator = Unicode::Collate->new(
level => 1,
normalization => undef,
);
# This prints "Queensrÿche and Queensryche are the same word."
printf "$fancy and $plain %s the same word.\n",
$collator->eq($fancy, $plain) ? "are" : "aren't";
exit 0;
As it says in the script, this correctly prints "Queensrÿche and Queensryche are the same word." Whether or not this is exactly what's displayed in your terminal window is another matter altogether—one that's not related to the Perl script.
See Perl Unicode Cookbook: Case- and Accent-insensitive Comparison by Tom Christiansen (tchrist).
Update: By the way, in this same configuration of Unicode::Collate, the strings "QUEENSRŸCHE" and "Queensryche" will compare equal as well.
| [reply] [Watch: Dir/Any] [d/l] |
Re: The Queensrÿche Situation
by ikegami (Patriarch) on Oct 19, 2014 at 22:49 UTC
|
Why is the "ÿ" not printing correctly here in my terminal?
Your terminal expects UTF-8. You printed chr(0xFF), which is not the UTF-8 encoding of "ÿ".
You can encode it yourself, or you ask Perl to do it using the following:
use open ':std', ':encoding(UTF-8)';
ord() returns 255 for ÿ, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
It's not UTF-8 (which would be C3 BF). is_utf8($string) does not indicate whether $string contains UTF-8.
It's not UTF-16 (which would be 00 FF or FF 00 depending on endianness).
Decoding string (as use utf8; does for literals) results in Unicode Code Points ("ÿ" is U+00FF).
This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
That is the UTF-8 encoding of "Queensrÿche", though it is incorrect to say that is_utf8 signifies that Encode agrees.
Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?
Tools that work with text (such as regular expressions and Text::Unaccent::PurePerl) usually expect the text to be provided as strings of Unicode Code Points, not encoded using UTF-8.
Is there a way to safely convert them to the same encoding?
Aformentioned
use open ':std', ':encoding(UTF-8)';
will also tell Perl to decode bytes read from file handles.
use utf8;
use encoding ':std', ':encoding(UTF-8)';
use JSON::XS qw( decode_json encode_json );
my $s = "Queensrÿche";
printf("U+%v04X %s\n", $s, $s);
{
# Uses encoding specified by "use open".
open(my $fh, '>', 'foo.txt') or die $!;
print($fh "$s\n");
}
{
# Uses encoding specified by "use open".
open(my $fh, '<', 'foo.txt') or die $!;
chomp( my $got = <$fh> );
printf("U+%v04X %s\n", $got, $got);
}
{
# :raw overrides default encoding specified above
# since encode_json already encodes using UTF-8
open(my $fh, '>:raw', 'foo.json') or die $!;
print($fh encode_json( { text => $s } ));
}
{
my $json = do {
# Similarly, decode_json expects UTF-8.
open(my $fh, '<:raw', 'foo.json') or die $!;
local $/;
<$fh>
};
my $got = decode_json($json)->{text};
printf("U+%v04X %s\n", $got, $got);
}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Got it. So, "is_utf8" just tells us that the utf-8 flag is set?
| [reply] [Watch: Dir/Any] |
|
Exactly. It merely says which internal storage format is used. It's only useful for debugging XS modules, if at all.
(Added plain text example to the program in my earlier post.)
| [reply] [Watch: Dir/Any] |
Re: The Queensrÿche Situation
by LanX (Saint) on Oct 19, 2014 at 18:06 UTC
|
Many question, but I'd be surprised if the default font of your terminal supported a fictitious° character like ÿ.
See also Metal Umlaut! :)
Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
°) well maybe not fictitious but very rare. But the Latin 1 code is 255 which answers another question.
update
Btw its not an umlaut!
In German its a medieval handwriting ligature of ij, a diphthong still found in Dutch (see rijk), those sounds are written ei in modern German (see Reich)
In French trema accents are used to pronounce adjacent vowels separately (see Citroën or naïve). English imported some of them. | [reply] [Watch: Dir/Any] |
|
The Dutch ij is still regarded as a single syllable, but written as ij. Even in official documents the ij has been banned. I however bet that every Dutch person will have no trouble reading the ij when ij was meant and vice versa.
I think that many of you won't even see the difference in their browser (unless off course ij is not represented in your font).
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
Sorry for the confusing nature of this post. I suppose it really just comes down to this. Which of these are utf8?
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 {c3}
191 {bf}
99 c c
104 h h
101 e e
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
255 {ff}
99 c c
104 h h
101 e e
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
|
Re: The Queensrÿche Situation
by Jim (Curate) on Oct 19, 2014 at 22:17 UTC
|
I highly recommend using these two companion applications when working with Unicode text as well as text in other vendor and national character sets (so-called "legacy" character encodings): BabelMap (Unicode character map for Windows) and BabelPad (Unicode text editor for Windows). They're both extraordinarily helpful when getting down 'n' dirty with Unicode.
| [reply] [Watch: Dir/Any] |