Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

The Queensr che Situation

by Rodster001 (Pilgrim)
on Oct 19, 2014 at 17:46 UTC ( #1104325=perlquestion: print w/replies, xml ) Need Help??
Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

It seems I have read every page on character encoding I can find. But I am missing something. I still have a bit of confusion which I hope can get cleared up here.
#!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl qw(unac_string); use utf8; my $string = "Queensr che"; no utf8; chars($string); (Encode::is_utf8($string))? print " - this is utf8\n" : print " - this + is NOT utf8\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; print $string; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $escape = qquote($_); print "\t$dec\t$chr\t$escape\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_; }
This is what I am seeing in my terminal (I am using secure crt, with Terminal > Appearance > Character encoding: UTF-8)
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e - this is utf8 unaccented: Queensryche Queensr
Here are my questions about this:
  1. Why is the " " not printing correctly here in my terminal?
  2. ord() returns 255 for  , a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?
    Utf-16 table: http://asecuritysite.com/coding/asc2
I have another version of "Queensr che" (in a JSON file), when I parse that and run it though the same thing, this is what I get:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e - this is utf8 unaccented: QueensrA Queensr che
This is where the deep confusion is for me.
  1. This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?
    ord() returns two bytes for  , 195 and 191 which matches up with this table:
    Utf-8 table: http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
  2. Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?
Finally. Since these two strings cannot be compared and matched as being the same (which I understand why) I need to "normalize" them.
  1. Is #1 Queensr che or #2 Queensr che actually utf-8? (Or are they both actually utf-8 as Encode believes?)
  2. Is there a way to safely convert them to the same encoding? I would like to preserve the   but I would also like to be able to use Text::Unaccent::PurePerl.
Thanks!

Update #1:
--------------------------------------

Taking out the "use utf8" and "no utf8":

#use utf8; my $string = "Queensr che"; #no utf8;
And then running it again:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 191 99 c c 104 h h 101 e e - this is NOT utf8 unaccented: QueensrA Queensr che
This confuses me even more. I understand the utf8 flag is not set now, so Encode doesn't see it as utf8. But I see the two utf-8 bytes for the " " are there (195 191) instead of 255 when using "use utf8". It prints correctly (and displays in my terminal properly) but does not unaccent correctly. Much confusion.

Replies are listed 'Best First'.
Re: The Queensr che Situation
by aitap (Deacon) on Oct 19, 2014 at 19:00 UTC

    You didn't use binmode to apply an IOLayer to encode Unicode characters you print to STDOUT, neither you encode them manually. When Perl encounters characters where it expects bytes (in any IO) it applies some heuristics to translate the former to the latter. Usually it means that what can be translated to latin1 gets (silently!) translated and everything else is printed in utf8 (with a warning):

    $ perl -w -Mutf8 -E'say "ы"; say " ";'
    Wide character in say at -e line 1.
    ы
    �
    
    (my terminal is utf-8)

    And when you use utf8, Perl decodes utf8 byte string literals into characters for you. The same is done by Encode::decode.

    Does adding binmode STDOUT, ":utf8"; fix your problem? You can also use :encoding(...) IOLayers to encode into other encodings.

      Yes! That fixes the printing problem in my terminal. And this makes complete sense now. Thank you for clearing this up!

      One problem remains that I still don't quite understand.

      #!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl; binmode STDOUT, ":utf8"; use utf8; my $string = "Queensr che"; no utf8; chars($string); (Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO +T utf8\n"; print "$string\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $q = qquote($_); print "\t$dec\t$chr\t$q\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_;
      Why does that produce, this:
      81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255   \x{ff} 99 c c 104 h h 101 e e this is utf8 Queensr che unaccented: Queensryche
      Is that actually valid utf-8? Shouldn't the   be two bytes (decimal 195 191)? Like this:
      81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e

        When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example,

        use utf8;
        binmode STDOUT, ":utf8";
        my $string = "Queensr che ы";
        printf "%x\t%s\n", ord($_), $_ for split "", $string;
        __END__
        51      Q
        75      u
        65      e
        65      e
        6e      n
        73      s
        72      r
        ff       
        63      c
        68      h
        65      e
        20       
        44b     ы
        

        If you need to work with utf-8 bytes, encode them back:

        use utf8;
        use Encode 'encode';
        binmode STDOUT, ":utf8";
        my $string = "Queensr che ы";
        printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string;
        __END__
        51      Q
        75      u
        65      e
        65      e
        6e      n
        73      s
        72      r
        c3      ├
        bf      ┐
        63      c
        68      h
        65      e
        20       
        d1      Đ
        8b
        
        But there would be no point in using utf8 and Encode in this case.

        "Yes! That fixes the printing problem in my terminal!"

        Thats nice. But just to add a little bit confusion., please see this:

        A One-Liner prints it out as expected:

        karl$ perl -e 'print qq(Queensr che\n)' Queensr che

        But please see what happens when i put the stuff into a script (in the same terminal session):

        #!/usr/bin/env perl use strict; use warnings; binmode STDOUT, ":utf8"; my $string = qq(Queensr che); print qq($string\n); my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS}); print qq($y_with_trema\n); $string = qq(Queensr) . $y_with_trema . qq(che); print qq($string\n); __END__ karls-mac-mini:monks karl$ ./roadster001.pl Queensr├┐che   Queensr che

        Seems like things are getting weird. I wonder when i ever will understand this crap.

        N.B.: I came in a bit late and didn't read all the posts yet.

        Best regards, Karl

        źThe Crux of the Biscuit is the Apostrophe╗

        I figured it out, sort of. The first is actually ascii (255 maps to " "): http://www.ascii-code.com

        So, when I take the string "Queensr che" (which IS actually encoded as utf-8) for example:
        Decimal Char escaped 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e
        It is now printing on my terminal like this:
        Queensr├┐che
        This makes sense, in a way, now because 195 maps to "├" and 191 maps to "┐". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on).
Re: The Queensr che Situation
by Jim (Curate) on Oct 19, 2014 at 20:39 UTC

    If you only have to deal with Unicode—and you properly should only have to deal with Unicode in this millennium—then use the Unicode collation algorithm instead of something non-standard. In Perl, this means using Unicode::Collate. Both the Unicode collation algorithm and the Perl CPAN module Unicode::Collate are customizable.

    use strict; use warnings; # This Perl script is Unicode UTF-8 use utf8; # Proper Unicode collation use Unicode::Collate; # The output of this Perl script is Unicode UTF-8 binmode STDOUT, ':encoding(UTF-8)'; my $fancy = 'Queensr che'; my $plain = 'Queensryche'; my $collator = Unicode::Collate->new( level => 1, normalization => undef, ); # This prints "Queensr che and Queensryche are the same word." printf "$fancy and $plain %s the same word.\n", $collator->eq($fancy, $plain) ? "are" : "aren't"; exit 0;

    As it says in the script, this correctly prints "Queensr che and Queensryche are the same word." Whether or not this is exactly what's displayed in your terminal window is another matter altogether—one that's not related to the Perl script.

    See Perl Unicode Cookbook: Case- and Accent-insensitive Comparison by Tom Christiansen (tchrist).

    Update:  By the way, in this same configuration of Unicode::Collate, the strings "QUEENSRčCHE" and "Queensryche" will compare equal as well.

Re: The Queensr che Situation
by ikegami (Pope) on Oct 19, 2014 at 22:49 UTC

    Why is the " " not printing correctly here in my terminal?

    Your terminal expects UTF-8. You printed chr(0xFF), which is not the UTF-8 encoding of " ".

    You can encode it yourself, or you ask Perl to do it using the following:

    use open ':std', ':encoding(UTF-8)';

    ord() returns 255 for  , a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?

    It's not UTF-8 (which would be C3 BF). is_utf8($string) does not indicate whether $string contains UTF-8.

    It's not UTF-16 (which would be 00 FF or FF 00 depending on endianness).

    Decoding string (as use utf8; does for literals) results in Unicode Code Points (" " is U+00FF).

    This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?

    That is the UTF-8 encoding of "Queensr che", though it is incorrect to say that is_utf8 signifies that Encode agrees.

    Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?

    Tools that work with text (such as regular expressions and Text::Unaccent::PurePerl) usually expect the text to be provided as strings of Unicode Code Points, not encoded using UTF-8.

    Is there a way to safely convert them to the same encoding?

    Aformentioned

    use open ':std', ':encoding(UTF-8)';
    will also tell Perl to decode bytes read from file handles.
    use utf8; use encoding ':std', ':encoding(UTF-8)'; use JSON::XS qw( decode_json encode_json ); my $s = "Queensr che"; printf("U+%v04X %s\n", $s, $s); { # Uses encoding specified by "use open". open(my $fh, '>', 'foo.txt') or die $!; print($fh "$s\n"); } { # Uses encoding specified by "use open". open(my $fh, '<', 'foo.txt') or die $!; chomp( my $got = <$fh> ); printf("U+%v04X %s\n", $got, $got); } { # :raw overrides default encoding specified above # since encode_json already encodes using UTF-8 open(my $fh, '>:raw', 'foo.json') or die $!; print($fh encode_json( { text => $s } )); } { my $json = do { # Similarly, decode_json expects UTF-8. open(my $fh, '<:raw', 'foo.json') or die $!; local $/; <$fh> }; my $got = decode_json($json)->{text}; printf("U+%v04X %s\n", $got, $got); }
      Got it. So, "is_utf8" just tells us that the utf-8 flag is set?

        Exactly. It merely says which internal storage format is used. It's only useful for debugging XS modules, if at all.

        (Added plain text example to the program in my earlier post.)

Re: The Queensr che Situation
by LanX (Bishop) on Oct 19, 2014 at 18:06 UTC
    Many question, but I'd be surprised if the default font of your terminal supported a fictitious░ character like  .

    See also Metal Umlaut! :)

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    ░) well maybe not fictitious but very rare. But the Latin 1 code is 255 which answers another question.

    update

    Btw its not an umlaut!

    In German its a medieval handwriting ligature of ij, a diphthong still found in Dutch (see rijk), those sounds are written ei in modern German (see Reich)

    In French trema accents are used to pronounce adjacent vowels separately (see CitroŰn or na´ve). English imported some of them.

      The Dutch ij is still regarded as a single syllable, but written as ij. Even in official documents the ij has been banned. I however bet that every Dutch person will have no trouble reading the ij when ij was meant and vice versa.

      I think that many of you won't even see the difference in their browser (unless off course ij is not represented in your font).


      Enjoy, Have FUN! H.Merijn
        A single letter, really?

        Interesting IJ_(digraph)

        In standard German single vowels are always monophthongs.

        At least I know now where the Swiss canton of Schwyz got its y from :)

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

      Sorry for the confusing nature of this post. I suppose it really just comes down to this. Which of these are utf8?
      81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 {c3} 191 {bf} 99 c c 104 h h 101 e e 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 {ff} 99 c c 104 h h 101 e e
Re: The Queensr che Situation
by Jim (Curate) on Oct 19, 2014 at 22:17 UTC

    I highly recommend using these two companion applications when working with Unicode text as well as text in other vendor and national character sets (so-called "legacy" character encodings):  BabelMap (Unicode character map for Windows) and BabelPad (Unicode text editor for Windows). They're both extraordinarily helpful when getting down 'n' dirty with Unicode.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1104325]
Approved by LanX
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2017-11-21 05:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:













    Results (295 votes). Check out past polls.

    Notices?