Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: Default encoding rules leave me puzzled...

by Anonymous Monk
on Jun 20, 2014 at 20:28 UTC ( #1090689=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Default encoding rules leave me puzzled...
in thread Default encoding rules leave me puzzled...

That means, If you provide Unicode code points, you will get Unicode code points.
How can I "get Unicode code points"?. Code points is an abstraction, it's an internal Perl thing. It must produce a bunch of bytes. Yes, some codepoints can be packed into a single byte. And this is what Perl does. Call it what you will.
perl -e'print pack "C*", 0..255;'
Or even
perl -E 'say "Franais"'
That prints bytes as it recieved them from bash: 0x46.0x72.0x61.0x6e.0XC3.0XA7.0x61.0x69.0x73. On the other hand
perl -E 'use utf8; say "Franais"'
That prints garbage instead of ''. The bytes are 0x46.0x72.0x61.0x6e.0XE7.0x61.0x69.0x73 and my terminal cannot display 0XE7.
It's just that the latin-1 encoding of the first 256 code points is itself.
Yes, encoding.


Comment on Re^3: Default encoding rules leave me puzzled...
Select or Download Code
Re^4: Default encoding rules leave me puzzled...
by ikegami (Pope) on Jun 21, 2014 at 07:17 UTC

    Code points is an abstraction, it's an internal Perl thing.

    What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.

    It must produce a bunch of bytes.

    No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.

    In which of the following is does print use iso-latin-1?

    use utf8; my $s1 = inet_aton('195.169.195.171'); print($s1); my $s2 = encode_utf8(""); print($s2); my $s3 = "éë"; print($s3); my $s4 = "\xC3\xA9\xC3\xAB"; print($s4);

    The only two possible answers are "all of them" or "none of them", since print can't tell the difference between those strings.

    If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.

    That prints garbage instead of ''.

    Because the terminal expects bytes of UTF-8, but it got bytes of Unicode code points.

      What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.
      And the idea that it's OK to treat OCTET 0xE7 as a substitue for code point U+00E9 is totally not defined by the consortium.
      No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.
      OMG. Who cares what print expects. Even Perl (in other parts) thinks that that's ridiculous.
      perl -wE 'say "" + ""'
      The operator plus expects numbers, just like print, right?
      If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.
      Printing UNICODE STRINGS (and Perl CAN tell the difference between binary and unicode) on binary STDOUT produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255. The Consortium totally wouldn't approve of that. And that's it. It appears you just don't like the word 'encoding'. Most people would still Perl's behavior 'encoding', that word is certainly good enough for me. You (MAYBE) would've had a point if Perl actually stored unicode codepoint U+00E7 as an octet 0xE7 internally. But we know that it doesn't anyway. Have a nice day.
        I remembered something.
        perl -MScalar::Util=looks_like_number -wE 'use utf8; say looks_like_nu +mber("")? "yes" : "no"'

        produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255

        It gives the same result, yes, but only by virtue of Unicode code points being rather similar to iso-latin-1, not because print does any encoding.

        print does this:

        - If any of the elements of the string is larger than 255, - Warn "wide character". - Encode the string using utf8. - For each element of the string, - Print that number as a byte.

        The operator plus expects numbers, just like print, right?

        Two individual numbers, yes. print takes two strings of them. The bitwise operators accept either.

        $ perl -E'say "ABC" | " "' abc

      Your Perl script doesn't compile.

      C:\>chcp
      Active code page: 437
      
      C:\>type 1090732.pl
      use utf8;
      my $s1 = inet_aton('195.169.195.171');  print($s1);
      my $s2 = encode_utf8("├⌐├");             print($s2);
      my $s3 = "├┬⌐├┬";                        print($s3);
      my $s4 = "\xC3\xA9\xC3\xAB";            print($s4);
      
      C:\>cat 1090732.pl
      use utf8;
      my $s1 = inet_aton('195.169.195.171');  print($s1);
      my $s2 = encode_utf8("");             print($s2);
      my $s3 = "éë";                        print($s3);
      my $s4 = "\xC3\xA9\xC3\xAB";            print($s4);
      
      C:\>perl 1090732.pl
      Undefined subroutine &main::inet_aton called at 1090732.pl line 2.
      
      C:\>
      

        inet_aton is provided by Socket, and encode_utf8 is provided by Encode. I left a few obvious headers out since they weren't relevant.

        In all four cases, print outputs the four bytes C3 A9 C3 AB because in all four cases, the string passed to print was "\xC3\xA9\xC3\xAB".

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1090689]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2014-09-21 07:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (167 votes), past polls