Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^4: DWIM with non ASCII characters

by Hue-Bond (Priest)
on May 07, 2010 at 08:23 UTC ( [id://838887]=note: print w/replies, xml ) Need Help??


in reply to Re^3: DWIM with non ASCII characters
in thread DWIM with non ASCII characters

More importantly, use utf8; allows you to do
my $foo = 'ñ';

Hmm, then I must have configured something in my system, since I can do that without use'ing utf8:

$ xxd ñ.pl 0000000: 7072 696e 7420 27c3 b127 0a print '..'. $ env -i /usr/bin/perl -Mstrict -wl ñ.pl ñ

--
 David Serrano
 (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Replies are listed 'Best First'.
Re^5: DWIM with non ASCII characters
by almut (Canon) on May 07, 2010 at 15:10 UTC

    This only works because you have a UTF-8 terminal, but haven't told Perl about it.  In other words, Perl is treating the UTF-8 encoded byte sequence in the source code - which represents the Unicode char U+00F1 (ñ) - as two separate bytes, and passes them on as is (i.e. UTF-8 encoded) to the terminal, which consequently displays the character correctly.

    Perl internally, however, you don't have a character string, so you cannot properly match, etc.:

    #!/usr/local/bin/perl -l use strict; use warnings; use Encode; my $bytes = 'ñ'; # UTF-8 encoded source (c3 b1 = ñ) # displays as two latin1 chars here (c3 = Ã, b1 = + ±), # because PM doesn't handle UTF-8 my $chars = decode('UTF-8', $bytes); print '$bytes eq \x{f1} ? ', $bytes eq "\x{f1}" ? "match":"no match"; print '$chars eq \x{f1} ? ', $chars eq "\x{f1}" ? "match":"no match"; print '$bytes: ', $bytes; print '$chars: ', $chars; binmode STDOUT, "utf8"; print '$bytes (STDOUT is UTF-8): ', $bytes; print '$chars (STDOUT is UTF-8): ', $chars;

    The string comparison outputs:

    $bytes eq \x{f1} ? no match $chars eq \x{f1} ? match

    and the byte/char values print as (in a UTF-8 terminal):

    $bytes: ñ $chars: $bytes (STDOUT is UTF-8): ñ $chars (STDOUT is UTF-8): ñ

    Note that as soon as you tell Perl that your terminal is UTF-8 (with binmode), the byte string stops printing correctly, because Perl is now converting the two byte/latin1 chars c3 and b1 to the respective UTF-8 sequences c3 83 and c2 b1, which display as two separate characters...

      My output doesn't match yours, even when I do have the terminal in UTF-8. But that doesn't bug me since, as ikegami points out, the string lengths I get are wrong when I use non-ASCII characters:

      $ perl -l use warnings; use strict; print length 'ñ'; __END__ 2

      Now I wonder how is it possible that I've never encountered any problems with this :^). Thanks!

      --
       David Serrano
       (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Re^5: DWIM with non ASCII characters
by ikegami (Patriarch) on May 07, 2010 at 16:09 UTC

    That example demonstrates the use of an optimisation: You skipped specifying use utf8; by also skipping encoding. In the common case, it won't work. You'll find the length of the string is wrong. In turn, that means you'll have problems with regex, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://838887]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-19 08:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found