Default encoding rules leave me puzzled...

by kzwix (Sexton)
on Jun 20, 2014 at 08:55 UTC
kzwix has asked for the wisdom of the Perl Monks concerning the following question:

Hello, ô wise ones,

I'm using Perl under an UTF-8 Linux environment (LC_ALL='en_US.UTF-8'), with a perl script which is in an UTF-8-encoded file.

This script states "use utf8;" on one of its first lines, and then, I have some scalar definitions which use accentuated characters (I'm French, so there are some 'é', 'è', 'à', 'ô', 'ù', etc.)

However, when I print strings defined in the script, I see they are garbled (the accentuated characters come out as ugly white squares on the black background of my Putty terminal - Which is ALSO configured to use UTF-8)

Now, if I write :
binmode STDOUT, ':encoding(UTF-8)';
and do the same for STDERR, I get no more garbled characters, they all come out cleanly.

My question is: Why do I have to specify this encoding ? I thought that Perl adapted to its environment, and the localization environment variables should all be readable, right ?

Can someone explain the reason to me, or point me to relevant documentation ?

Thanks in anticipation.

EDIT: Ok, so, after reading through all these fine answers, I think I understand better my problems, and the solutions:

  • I had probably tried a Perl version which DID take upon itself to automatically convert some streams, but not some others, and that had left me confused
  • I'm pretty sure, by now, that Perl "decodes" input from Latin-1, and "encodes" output to Latin-1, by default.
  • I've discovered a fine command-line switch, '-C' (you may read about it in the 'perlrun' documentation), which does exactly what I would have expected Perl to do, if called with the parameter 'L'. That is, '-CL' should decode STDIN from UTF-8, or encode STDOUT / STDERR to UTF-8, if such an encoding is mentioned in the 'LC_ALL', 'LC_TYPE', or 'LANG' environment variables.
  • Of course, using explicit encoding (or explicitly stating in the script that we want it autodetected) is best practice, even if a bit more cumbersome to write

Please feel free to correct me if you feel that my conclusions are false or misleading.

Re: Default encoding rules leave me puzzled...
by tobyink (Abbot) on Jun 20, 2014 at 10:02 UTC

    When you include the use utf8 pragma, all this says to Perl is that your script itself is written in UTF-8. It says nothing about what encoding it used by different filehandles.

    Regardless of your environment, handles in Perl default to operating as binary byte streams. (The exact details depend on whether you're on an MS-DOS-like machine or a Unix-like machine, but the results are much the same.) This is documented in PerlIO.

    Once upon a time (in Perl 5.8.0), Perl used to automatically pick up the locale from the environment. This was somewhat unpredictable and unexpected, and this feature was removed in the next Perl release (5.8.1).

    Now, if you want Perl to sniff your environment you need to explicitly ask for it. With binmode or the open pragma, use the :locale layer.

    You might also want to check out the -C command line option which provides some facilities for switching the standard input/output/error handles to UTF-8, either unconditionally or depending on your environment.

    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
      Thanks, I guess I had the misfortune of making my first steps with Encode under the 5.8.0 version, then. I'm glad there now is a "predictable" behavior instead.
Re: Default encoding rules leave me puzzled...
by zentara (Archbishop) on Jun 20, 2014 at 13:24 UTC
Re: Default encoding rules leave me puzzled... (use open qw/ :std :locale /;
by Anonymous Monk on Jun 20, 2014 at 09:04 UTC

      Sorry, I realize I wasn't specific enough:
      I've read about Encode, and successfully used it in a previous project. I know about the need to decode and encode streams, too. However, it seemed to me that Perl did some of this job itself (as I had tried to explicitly decode data from the standard input, or from command-line arguments, and had experienced strange results)

      So, is there some place where it is explicitly stated what is converted by perl, in a transparent manner, and what isn't ?

      Furthermore, even though I didn't Encode or Decode the streams, shouldn't it "just work", if the scalar value is specified in UTF-8 (because the file is encoded as such), and Perl is AWARE that it is UTF-8 (because of 'use utf8;'), and Perl stores it internally in UTF-8, and the expected output format is UTF-8 too ?

      I'm pretty sure there is a catch I haven't figured out, there, but pointing it to me, even if obvious, could help. Thanks !

      EDIT: I've run a short test, using a Latin-1 terminal (this test script is fully encoded in UTF-8):

      #!/usr/bin/perl use utf8; use Encode; $\ = "\n"; my $unicodeScalar = "Je suis une chaîne accentuée là où il faut."; print '['.Encode::is_utf8($unicodeScalar).'] '.$unicodeScalar;

      Using my Latin-1 terminal, I displayed the source file, and, sure enough, the contents were garbled (2 strange bytes for each accentuated character, which confirmed me the file was truly UTF-8), then I ran the script. And I got a perfect display.

      So, does Perl assume by default, even in a UTF-8 environment, that it should output everything in Latin-1 ?

        So, does Perl assume by default, even in a UTF-8 environment, that it should output everything in Latin-1 ?

        Perl tries to not convert anything at all, automatically.

        And since Latin-1 (mostly?) maps the first 256 codepoints 1:1 to bytes, outputting something without any conversion is the same as outputting it as Latin-1.

        Note that this round-trips binary data, which means that if your scripts or input use UTF-8, and you don't use utf8;, the output will be UTF-8 again.

        But, Latin-1 is limited to codepoints up to 255, so if something higher than that shows up in your string, perl falls back to UTF-8 (and warns).

        (As always, I'm linking to Encodings and Unicode in Perl, in the hope that it's useful to you).

Re: Default encoding rules leave me puzzled...
by Anonymous Monk on Jun 20, 2014 at 11:05 UTC

    It appears, when Perl prints to binary STDOUT, it tries to encode some strings as Latin-1

    perl -wE 'use utf8; say q(Français)' | perl -lnwE 'print join q(:), un +pack q(C*), $_' - 70:114:97:110:231:97:105:115
    char 241 is not valid utf-8. Interestingly enough...

    perl -wE 'use utf8; say q(Русский)' | perl -lnwE 'print join q(:), unpack q(C*), $_' -

    Wide character in say at -e line 1. 208:160:209:131:209:129:209:129:208:186:208:184:208:185

    ... which is valid utf-8... and a warning.

    Well... what do you expect? Even this site mangles utf-8 characters in 'code' tags, if it cannot decode them as Latin-1.

    I think perlmonks is written in Perl ;) Yes, that's all pretty mysterious and confusing.

      It appears, when Perl prints to binary STDOUT, it tries to encode some strings as Latin-1

      No. When you don't specify an encoding, print expects bytes, and prints those bytes provided without encoding.

      $ perl -e'print pack "C*", 0..255;' | od -t x1 0000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 0000020 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 0000040 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 0000060 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 0000100 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 0000120 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 0000140 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 0000160 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 0000200 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 0000220 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f 0000240 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af 0000260 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf 0000300 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf 0000320 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df 0000340 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef 0000360 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 0000400

      That means,

      • If you provide Unicode code points, you will get Unicode code points.
      • If you provide latin-1, you will get latin-1.
      • If you provide latin-2, you will get latin-2.
      • If you provide gzipped data, you will get gzipped data.
      • etc

      In your example, 70:114:97:110:231:97:105:115 are the Unicode code points that formed "Français". It's just that the latin-1 encoding of the first 256 code points is itself.

      $ perl -MEncode=encode -E' $_ = pack "C*", 0..255; say $_ eq encode("iso-latin-1", $_) ? "same" : "diff"; ' same

      Exception: If any of the characters it he string are not bytes (larger than 255), print will assume you forgot to specify :utf8. it will warn ("wide character") and encode the characters accordingly.

        That means, If you provide Unicode code points, you will get Unicode code points.
        How can I "get Unicode code points"?. Code points is an abstraction, it's an internal Perl thing. It must produce a bunch of bytes. Yes, some codepoints can be packed into a single byte. And this is what Perl does. Call it what you will.
        perl -e'print pack "C*", 0..255;'
        Or even
        perl -E 'say "Français"'
        That prints bytes as it recieved them from bash: 0x46.0x72.0x61.0x6e.0XC3.0XA7.0x61.0x69.0x73. On the other hand
        perl -E 'use utf8; say "Français"'
        That prints garbage instead of 'ç'. The bytes are 0x46.0x72.0x61.0x6e.0XE7.0x61.0x69.0x73 and my terminal cannot display 0XE7.
        It's just that the latin-1 encoding of the first 256 code points is itself.
        Yes, encoding.


      char() takes codepoints, not UTF-8.

        What do you mean? I'm pretty sure pack's C* takes C chars, that is, octets, bytes. At least that's what the documentation says. And 231:97 is not a valid utf-8 sequence. So in binary mode Perl encodes codepoints < 256 as Latin-1, otherwise it just spits out utf-8 as is. That's pretty crazy.
        perl -CO -wE 'use utf8; say q(Français)' | perl -lnwE 'print join q(:) +, unpack q(C*), $_' 70:114:97:110:195:167:97:105:115

