http://www.perlmonks.org?node_id=1090761


in reply to Re^4: Default encoding rules leave me puzzled...
in thread Default encoding rules leave me puzzled...

What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "é" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.
And the idea that it's OK to treat OCTET 0xE7 as a substitue for code point U+00E9 is totally not defined by the consortium.
No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.
OMG. Who cares what print expects. Even Perl (in other parts) thinks that that's ridiculous.
perl -wE 'say "ç" + "ç"'
The operator plus expects numbers, just like print, right?
If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.
Printing UNICODE STRINGS (and Perl CAN tell the difference between binary and unicode) on binary STDOUT produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255. The Consortium totally wouldn't approve of that. And that's it. It appears you just don't like the word 'encoding'. Most people would still Perl's behavior 'encoding', that word is certainly good enough for me. You (MAYBE) would've had a point if Perl actually stored unicode codepoint U+00E7 as an octet 0xE7 internally. But we know that it doesn't anyway. Have a nice day.

Replies are listed 'Best First'.
Re^6: Default encoding rules leave me puzzled...
by ikegami (Patriarch) on Jun 22, 2014 at 00:18 UTC

    produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255

    It gives the same result, yes, but only by virtue of Unicode code points being rather similar to iso-latin-1, not because print does any encoding.

    print does this:

    - If any of the elements of the string is larger than 255, - Warn "wide character". - Encode the string using utf8. - For each element of the string, - Print that number as a byte.

    The operator plus expects numbers, just like print, right?

    Two individual numbers, yes. print takes two strings of them. The bitwise operators accept either.

    $ perl -E'say "ABC" | " "' abc
Re^6: Default encoding rules leave me puzzled...
by Anonymous Monk on Jun 21, 2014 at 12:46 UTC
    I remembered something.
    perl -MScalar::Util=looks_like_number -wE 'use utf8; say looks_like_nu +mber("ç")? "yes" : "no"'

      What output does that Perl command-line script produce?

      C:\>chcp Active code page: 437 C:\>perl -MScalar::Util=looks_like_number -wE "use utf8; say looks_lik +e_number('ç')? 'yes' : 'no'" no C:\>bash $ perl -MScalar::Util=looks_like_number -wE 'use utf8; say looks_like_ +number("ç") ? "yes" : "no"' Malformed UTF-8 character (1 byte, need 3, after start byte 0xe7) at - +e line 1. no $ exit C:\>

      By posting a command-line script and then not posting the output it produces, you've made no useful point—at least not one that's immediately understandable.

        If you saw ç in a console set to cp437, you didn't actually have ç in the script because the code is treated as being UTF-8.

        Other than properly encoding the ç, you can address that issue by replacing ç with chr(0xE7). It will still output no.

        If you saw ç in a console set to cp437, you didn't actually have ç in the script because the code is treated as being UTF-8.

        Other than properly encoding the ç, you can address that issue by replacing ç with chr(0xE7). It will still output no.

      OMG so that's probably what Perl actually does.
      Converts in-place the internal representation of the string from UTF-X to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
      Yeah, binary print works exactly like utf8::downgrade.
        Except it doesn't convert in place...
        perl -MEncode=encode -wE 'use utf8; my $c = q(Français); say $c; say e +ncode("utf-8", $c)'
        It encodes the string to Latin-1. Or EBCDIC. Case closed.