Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: DWIM with non ASCII characters

by ikegami (Pope)
on May 07, 2010 at 07:48 UTC ( #838884=note: print w/ replies, xml ) Need Help??


in reply to Re^2: DWIM with non ASCII characters
in thread DWIM with non ASCII characters

More importantly, use utf8; allows you to do

my $foo = '';

So far, I've stuck to ASCII in my sources, so use utf8; wouldn't do anything for me.

I thought the preferred way to decode/encode the program's input/output was by using Encode.

No way. Why encode and decode everything yourself when you can let PerlIO do it. At least, that's the way I see it.


Comment on Re^3: DWIM with non ASCII characters
Select or Download Code
Re^4: DWIM with non ASCII characters
by Hue-Bond (Priest) on May 07, 2010 at 08:23 UTC
    More importantly, use utf8; allows you to do
    my $foo = '';

    Hmm, then I must have configured something in my system, since I can do that without use'ing utf8:

    $ xxd .pl 0000000: 7072 696e 7420 27c3 b127 0a print '..'. $ env -i /usr/bin/perl -Mstrict -wl .pl

    --
     David Serrano
     (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

      This only works because you have a UTF-8 terminal, but haven't told Perl about it.  In other words, Perl is treating the UTF-8 encoded byte sequence in the source code - which represents the Unicode char U+00F1 () - as two separate bytes, and passes them on as is (i.e. UTF-8 encoded) to the terminal, which consequently displays the character correctly.

      Perl internally, however, you don't have a character string, so you cannot properly match, etc.:

      #!/usr/local/bin/perl -l use strict; use warnings; use Encode; my $bytes = 'ñ'; # UTF-8 encoded source (c3 b1 = ) # displays as two latin1 chars here (c3 = , b1 = + ), # because PM doesn't handle UTF-8 my $chars = decode('UTF-8', $bytes); print '$bytes eq \x{f1} ? ', $bytes eq "\x{f1}" ? "match":"no match"; print '$chars eq \x{f1} ? ', $chars eq "\x{f1}" ? "match":"no match"; print '$bytes: ', $bytes; print '$chars: ', $chars; binmode STDOUT, "utf8"; print '$bytes (STDOUT is UTF-8): ', $bytes; print '$chars (STDOUT is UTF-8): ', $chars;

      The string comparison outputs:

      $bytes eq \x{f1} ? no match $chars eq \x{f1} ? match

      and the byte/char values print as (in a UTF-8 terminal):

      $bytes: $chars: $bytes (STDOUT is UTF-8): ñ $chars (STDOUT is UTF-8):

      Note that as soon as you tell Perl that your terminal is UTF-8 (with binmode), the byte string stops printing correctly, because Perl is now converting the two byte/latin1 chars c3 and b1 to the respective UTF-8 sequences c3 83 and c2 b1, which display as two separate characters...

        My output doesn't match yours, even when I do have the terminal in UTF-8. But that doesn't bug me since, as ikegami points out, the string lengths I get are wrong when I use non-ASCII characters:

        $ perl -l use warnings; use strict; print length ''; __END__ 2

        Now I wonder how is it possible that I've never encountered any problems with this :^). Thanks!

        --
         David Serrano
         (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

      That example demonstrates the use of an optimisation: You skipped specifying use utf8; by also skipping encoding. In the common case, it won't work. You'll find the length of the string is wrong. In turn, that means you'll have problems with regex, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://838884]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-11-23 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (127 votes), past polls