Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^2: Default encoding rules leave me puzzled...

by ikegami (Pope)
on Jun 20, 2014 at 18:56 UTC ( #1090671=note: print w/ replies, xml ) Need Help??


in reply to Re: Default encoding rules leave me puzzled...
in thread Default encoding rules leave me puzzled...

It appears, when Perl prints to binary STDOUT, it tries to encode some strings as Latin-1

No. When you don't specify an encoding, print expects bytes, and prints those bytes provided without encoding.

$ perl -e'print pack "C*", 0..255;' | od -t x1 0000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 0000020 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 0000040 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 0000060 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 0000100 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 0000120 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 0000140 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 0000160 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 0000200 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 0000220 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f 0000240 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af 0000260 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf 0000300 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf 0000320 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df 0000340 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef 0000360 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 0000400

That means,

  • If you provide Unicode code points, you will get Unicode code points.
  • If you provide latin-1, you will get latin-1.
  • If you provide latin-2, you will get latin-2.
  • If you provide gzipped data, you will get gzipped data.
  • etc

In your example, 70:114:97:110:231:97:105:115 are the Unicode code points that formed "Français". It's just that the latin-1 encoding of the first 256 code points is itself.

$ perl -MEncode=encode -E' $_ = pack "C*", 0..255; say $_ eq encode("iso-latin-1", $_) ? "same" : "diff"; ' same

Exception: If any of the characters it he string are not bytes (larger than 255), print will assume you forgot to specify :utf8. it will warn ("wide character") and encode the characters accordingly.


Comment on Re^2: Default encoding rules leave me puzzled...
Select or Download Code
Re^3: Default encoding rules leave me puzzled...
by Anonymous Monk on Jun 20, 2014 at 20:28 UTC
    That means, If you provide Unicode code points, you will get Unicode code points.
    How can I "get Unicode code points"?. Code points is an abstraction, it's an internal Perl thing. It must produce a bunch of bytes. Yes, some codepoints can be packed into a single byte. And this is what Perl does. Call it what you will.
    perl -e'print pack "C*", 0..255;'
    Or even
    perl -E 'say "Français"'
    That prints bytes as it recieved them from bash: 0x46.0x72.0x61.0x6e.0XC3.0XA7.0x61.0x69.0x73. On the other hand
    perl -E 'use utf8; say "Français"'
    That prints garbage instead of 'ç'. The bytes are 0x46.0x72.0x61.0x6e.0XE7.0x61.0x69.0x73 and my terminal cannot display 0XE7.
    It's just that the latin-1 encoding of the first 256 code points is itself.
    Yes, encoding.

      Code points is an abstraction, it's an internal Perl thing.

      What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "é" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.

      It must produce a bunch of bytes.

      No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.

      In which of the following is does print use iso-latin-1?

      use utf8; my $s1 = inet_aton('195.169.195.171'); print($s1); my $s2 = encode_utf8("éë"); print($s2); my $s3 = "éë"; print($s3); my $s4 = "\xC3\xA9\xC3\xAB"; print($s4);

      The only two possible answers are "all of them" or "none of them", since print can't tell the difference between those strings.

      If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.

      That prints garbage instead of 'ç'.

      Because the terminal expects bytes of UTF-8, but it got bytes of Unicode code points.

        What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "é" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.
        And the idea that it's OK to treat OCTET 0xE7 as a substitue for code point U+00E9 is totally not defined by the consortium.
        No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.
        OMG. Who cares what print expects. Even Perl (in other parts) thinks that that's ridiculous.
        perl -wE 'say "ç" + "ç"'
        The operator plus expects numbers, just like print, right?
        If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.
        Printing UNICODE STRINGS (and Perl CAN tell the difference between binary and unicode) on binary STDOUT produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255. The Consortium totally wouldn't approve of that. And that's it. It appears you just don't like the word 'encoding'. Most people would still Perl's behavior 'encoding', that word is certainly good enough for me. You (MAYBE) would've had a point if Perl actually stored unicode codepoint U+00E7 as an octet 0xE7 internally. But we know that it doesn't anyway. Have a nice day.

        Your Perl script doesn't compile.

        C:\>chcp
        Active code page: 437
        
        C:\>type 1090732.pl
        use utf8;
        my $s1 = inet_aton('195.169.195.171');  print($s1);
        my $s2 = encode_utf8("éë");             print($s2);
        my $s3 = "éë";                        print($s3);
        my $s4 = "\xC3\xA9\xC3\xAB";            print($s4);
        
        C:\>cat 1090732.pl
        use utf8;
        my $s1 = inet_aton('195.169.195.171');  print($s1);
        my $s2 = encode_utf8("éë");             print($s2);
        my $s3 = "éë";                        print($s3);
        my $s4 = "\xC3\xA9\xC3\xAB";            print($s4);
        
        C:\>perl 1090732.pl
        Undefined subroutine &main::inet_aton called at 1090732.pl line 2.
        
        C:\>
        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1090671]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2014-10-02 10:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (52 votes), past polls