http://www.perlmonks.org?node_id=731943

wanradt has asked for the wisdom of the Perl Monks concerning the following question:

Fellow devoted, on path of mine have i two bound questions which rise after time to time and i have not found clear answers to them.

First is simple, practical one. I need, that every possible input and output to/from my script will treated as UTF-8. So i made a test-script which (through the wild and hard ways) almost satisfies this criterion. Still i can't get properly handled command line arguments, i still had to use decode on @ARGV. So, the question: how should i get the @ARGV properly treated and is there simpler way to handle input/output than i did in script below?

#!/usr/bin/perl use strict; use warnings; use locale; use utf8; use Encode; binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; # test non-default output too open(OUT, ">utf8", "sample_out.txt"); print "Command line argument: \n"; my $t = $ARGV[0]; &output_string($t, decode("utf-8", $t)); print "Enter some umlaut, please: "; $t = <STDIN>; # õäöüšž is good input to test chomp($t); &output_string($t); print "Variable from source code: \n"; $t = "õäöüšž"; &output_string($t); print "String from file: \n"; open(IN, "<utf8", "sample.txt"); $t = <IN>; close(IN); chomp($t); &output_string($t); close(OUT); sub output_string { my ($str) = shift; my ($dstr) = shift || ''; my ($ustr) = uc($str); print length($str), " $str $ustr $dstr\n\n"; print OUT "$str $ustr $dstr\n"; } __END__ sample.txt contains the same string: õäöüšž

And second one, assuming that my script is based on right understanding of status quo in Perl: Why is UTF-8 string handling so painful in Perl?

I try to explain, how i see things.

In Perl we have good things - pragmas. So when i tell to my script, hey, i need to make everything look like it is common to my location, i just say "use locale;" If i have properly set up system locale, it should spread to my program too. In reality i can't see such thing. In this example script above is no difference using locale or not. Did i told i have it set? With Posix setlocale i checked out that perl sees my locale (et_EE.UTF-8) but it seems have no influence to input/output chain or character-handling. I hoped, that maybe we have bug in our system locale, but there was no change when i used different locales with UTF-8 support. So, i found, i can't rely on "use locale" and it is sad.

Then (and this was even on last century) i found other pragma - utf8. It was good day. But not for long time, cause it did not make what i hoped. Pod says:

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope
So basically this does not change so much and is good for beginners like me, so i am not forced to separate program logic and content strings. It does not have power to handle IO. So this pragma did not help me too.

On the way to get things to work with UTF-8 i learned some tricks or hacks, but i don't see the systematic solution. I'd like to see that Some Pragma just makes every string in its lexical scope appear as unicode and that all the IO is also unicode proof. As concept it seems to me so easy :) In manuals i read something like "if parser sees wide character the utf-flag is turned on". Why? What harm it may make when user defines a scope to be fully unicoded and every piece is treated as unicode? No fears, no doubts, no need to check strings against some tests. It seems so simple to me that doubts rise and i must admit: it is almost sure i miss some piece from big picture.

So, after using super search here too and after reading some pods i'd like to ask: what makes so hard to implement real unicode pragma?

Nõnda, WK

Replies are listed 'Best First'.
Re: Pragma to handle unicode characters
by ikegami (Patriarch) on Dec 22, 2008 at 01:32 UTC
    :utf8 on input is insecure.
    binmode STDIN, ":utf8"; binmode STDOUT, ":utf8";

    should be

    binmode STDIN, ":encoding(UTF-8)"; binmode STDOUT, ":encoding(UTF-8)"; binmode STDERR, ":encoding(UTF-8)";

    and can be replaced with

    use open ':std', ':encoding(UTF-8)';

    or better yet

    use open ':std', ':locale';

    :utf8 on input is insecure.
    open(IN, "<utf8", "sample.txt");

    should be

    open(IN, "<:encoding(UTF-8)", "sample.txt");

    and can be replaced with

    use open IO => ':encoding(UTF-8)';

    or

    use open IO => ':locale';

    use utf8;
    indicates the source file is encoded using UTF-8. Without it, Perl assumes it's encoded using iso-latin-1.

    None of the above decodes the content of @ARGV, data read from open(FH, '-') or data read via <>. The last is being worked on.

Re: Pragma to handle unicode characters (roads)
by tye (Sage) on Dec 22, 2008 at 08:42 UTC

    Unicode was designed during the era where the concept of "byte stream" had become ubiquitous. I see but little evidence in the design of Unicode that its designers had much appreciation for the prior, messier era nor especially for the fact that what they were designing was going to destroy the then-current comfortable "everything is a byte stream" world.

    There were clearly plans for the eventual utopia of "all character strings/streams are in Unicode" but if there were plans for the uncomfortable transition period that we are currently moving into the heart of (things will still continue to get worse for a little while longer before they start getting better), I haven't seen much evidence of that.

    I'd expect to see leadership on this front from one or more sources of Unix operating systems (Linux, BSD, Sun, etc.). If it is there, I really haven't seen it. I still haven't seen evidence of a plan for this transition in Unix. I see incomplete pieces that try to cover the "before" case (everything is in Latin-1 or whatever) and try to cover the "after" case (everything is UTF-8), but little that deals with the mixed bag one currently usually finds oneself in, such as: I want to move forward with UTF-8 data in many of my files but tool Z can't handle that so I need to use Latin-1 data for Z but I want to keep using filenames and command lines written in my preferred Latin-2 for now.

    But dealing with this transition requires defining ways for applications to declare what type of data they are prepared to deal with (covering many different interfaces: command-line arguments, environment variables, file names, text streams, etc.).

    A relatively simple approach that I'd expect to see in Unix would be to enable a Unix to be built such that all text is stored in UTF-8 (file names, environment variable names and their values, text in configuration files, etc.). Then be able to declare that "application Z" only understands Latin-1 and so "application Z" gets passed an environment encoded in Latin-1 and has filename accesses translated between Latin-1 and UTF-8 for it, and text streams also get converted for it.

    Actually, Win32 is over a decade ahead of Unix on this front. WinNT did all system work in UTF-16 and let each program declare whether it wanted to use single-byte characters or "UNICODE" characters. Programs can even do a little bit of extra work and access both the single-byte-character APIs and the native UTF-32 APIs.

    That is why it was relatively easy for me to add Unicode support for file-system operations to Win32API::File (now if only I could finish testing and integrating those changes and get them uploaded to CPAN).

    But Perl has followed along with Unix and is still mostly unprepared for the ugly middle ground we often currently find ourselves in. But Perl is also unprepared for the eventual "all characters are UTF-8" utopia. But I think that part of the proper way to prepare for that future utopia is to define much better ways for declaring what encoding should be used on the different interfaces. Perl has finally gotten a good start on that when it comes to streams (if anything, there may be too many choices, but that is a good way to figure out what the best choices should be). And Perl has an acceptable start on dealing with the dual nature for its own character strings.

    But Perl has yet to define great ways of reconciling Unicode with filenames, environment variables, command lines, usernames, hostnames, etc. And a very simple "all interfaces want UTF-8" option seems like a wise goal to work toward.

    And I completely disagree that it is a good thing to force one to separately declare UTF-8ness on every interface. I think it is good to allow such, especially now, if convient. But UTF-8 is not some complex structure like PDF, HTML, JSON, etc. UTF-8 is very much like the choice between Latin-1 and Latin-2 (a choice of locale). It would be best if Perl could just notice "Oh, look, I'm finally running in the 'all is UTF-8' utopia" and work accordingly. I doubt anybody will ever be in a situation where "all streams are HTML", Unix username are HTML, Perl strings know whether they are HTML or just the backward-compatible "plain text", etc.

    - tye        

Re: Pragma to handle unicode characters
by graff (Chancellor) on Dec 22, 2008 at 04:32 UTC
    what makes so hard to implement real unicode pragma?

    Bear in mind that many people are not ready (or don't need) to pursue the use of unicode; and many of these people are dependent on Perl behaving a particular way with regard to handling i/o that is not strictly limited to ASCII characters.

    These people would be severely and negatively surprised if they discovered that by installing the next version of Perl, all of their existing scripts would need to be modified in order to preserve their original behaviors with regard to file i/o.

    (That actually happened once, with the introduction of Perl 5.8.0 on RedHat Linux: the particular RedHat release used utf-8 locale settings for the "default" shell environment, and that version of Perl used the locale settings in order to decide what the default i/o layer should be. Mayhem ensued because scripts that had worked previously were suddenly creating garbage. As a result, the Perl 5.8.1 release did not rely directly on locale settings for its default i/o layer selection.)

      These people would be severely and negatively surprised if they discovered that by installing the next version of Perl, all of their existing scripts would need to be modified ...

      The use of a dedicated new pragma (disabled by default) wouldn't bring about such backwards compatibility issues.

      It's kind of a pity, however, that the most intuitive name "use encoding ..." is already taken...

      Most things answered fellow Almut to you and to others practically the way i think, so thank you Almut. In one aspect i want still point:
      (That actually happened once, with the introduction of Perl 5.8.0 on RedHat Linux: the particular RedHat release used utf-8 locale settings for the "default" shell environment, and that version of Perl used the locale settings in order to decide what the default i/o layer should be.
      If those scripts were broke after making locale spread default, those script were buggy, right? No one should use locale pragma if they don't mean it, right?
        You misunderstood the situation. The victims of the problem never intended to use locale information in any way in their scripts, and the scripts were not written to use locale information. It just suddenly turned out (when the script was run on that particular RedHat release with that particular Perl version) that the use of locale information was imposed on them as "the new default" -- and many of them couldn't figure out why their scripts were suddenly failing until they turned to the community for help.

        "Oh, you need to change your shell environment so it doesn't use the new default utf8 locale, and/or you need to change your existing perl scripts..."

        As a rule, if you want to build some new functionality into a tool, and this is incompatible in some way with previous functionality that has an established base of users, it's better not to require that those established users change all their code for the sake of the new feature (which they might not have wanted in the first place).

Re: Pragma to handle unicode characters
by almut (Canon) on Dec 22, 2008 at 01:15 UTC
    (...) what makes so hard to implement real unicode pragma?

    I don't have an answer to your question, but I'd just like to point out another issue that you haven't even touched on: the handling of file names (such as sämple.txt)...

      How simple do you want encoding/decoding? Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?

      The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.

      Update: an example: e-mail - yes, you can send e-mails as UTF-8! But were you aware that MIME headers must be in a 7-bit encoding? In this case blindly opening a socket and telling it to encode all UTF-8 output will severely break your application. It is much better to know specifically when and where encoding is appropriate and permissible..

        Would you like Perl to "automagically" encode/decode JSON? ASN.1? Why, specifically, do you demand it of UTF-8?

        It's a matter of convenience, primarily — and in some cases, transparency (such as having a single point of configuration where the encoding can be switched, rather than requiring every piece of code to take care of it on its own).

        The comparison to JSON or ASN.1 seems somewhat far-fetched to me. Unicode is envisaged - and I think widely accepted - to eventually become the successor of legacy character encodings such as Latin-1, with their well known limits. And, among the Unicode encodings, UTF-8 would presumably be a good choice to be used as the default (because it was specifically designed with backwards compatibility in mind). In contrast, JSON / ASN.1 are rather special purpose (and typically not used as character encodings), so I don't currently see any need to have similar built-in support for them in Perl.

        The truth is that UTF-8 is a variable-length character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing.

        Equally (with a hypothetical pure ASCII mind set in place) you could say: "The truth is that Latin-1 is a (specific) 8-bit character encoding method. It's probably a good thing that you have to explicitly decode inputs and encode outputs.. it forces you to know what you are doing." — Still, we do have Latin-1 semantics by default in Perl...

        Just because UTF-8 is variable length doesn't mean it wouldn't be a sensible choice in environments that otherwise make use of it, in particular when the programmer explicitly requests that very functionality using a pragma.

        (...) MIME headers must be in a 7-bit encoding

        The current 8-bit default for IO could cause just as much potential breakage as UTF-8 would in this case. I don't think that particular limits which apply to certain content (or parts thereof) is a good argument against generally providing a way to conveniently say "I want UTF-8 to be used as default for all strings/content" (which is what I think the OP had in mind).  Special cases can be dealt with in the application code. As things are now, UTF-8 (or, more generally, anything non-Latin-1) is still too often the "special case", rather than a (configurable!) global default.

Re: Pragma to handle unicode characters
by borisz (Canon) on Dec 22, 2008 at 00:14 UTC
    I guess
    use encoding 'utf8';
    does what you want. The pragma change also the PerIO layer for STDIN and STDOUT.
    Boris
      The encoding pragma has issues, so I avoid it. Use of the utf8 and open pragmas is more suitable.

      I tried it, but there are other worries, warnings "Wide character in print at" and @ARGV is still not treated as UTF-8 chars, uc() does not recognize them as chars

      And perldoc utf8 says

      In case you are wondering: yes, "use encoding ’utf8’;" works much the same as "use utf8;".

      In perl58delta i read also:

      New Unicode Semantics (no more use utf8, almost)

      Previously in Perl 5.6 to use Unicode one would say "use utf8" and then the operations (like string concatenation) were Unicode-aware in that lexical scope.
      So i could not even use "use utf8", theoretically ;)

      Nõnda, WK
Re: Pragma to handle unicode characters
by jwkrahn (Abbot) on Dec 22, 2008 at 04:01 UTC
    I need, that every possible input and output to/from my script will treated as UTF-8.

    Use the open pragma:

    use open qw/ :std :utf8 /;

      That still wouldn't handle "every possible input", such as @ARGV, file names, and - while we're at it - environment variables.

Re: Pragma to handle unicode characters
by Anonymous Monk on Dec 22, 2008 at 04:34 UTC
    Have you tried the -C option or, equivalently, the PERL_UNICODE environment variable?

    It seems to help, but I don't have enough experience with unicode strings to really test it properly.

    $ perl -CSDAL -e 'print "the utf8 flag is ", utf8::is_utf8(shift) ? "o +n" : "off", " for command-line arguments\n"' hi...the utf8 flag is on + for command-line arguments
    $ perl -e 'print "the utf8 flag is ", utf8::is_utf8(shift) ? "o +n" : "off", " for command-line arguments\n"' hi... the utf8 flag is off for command-line arguments
      Have you tried the -C option or, equivalently, the PERL_UNICODE environment variable?
      No, i was not, now i tried and it helped me. After setting PERL_UNICODE=39 i still (just) need set in my script
      use utf8; use open ':std', ':encoding(UTF-8)';
      Thank you, it anwers my first question!

      It still does not affect @ENV variables, as applied fellow Almut.
      Nõnda, WK
        You might want to try PERL_UNICODE=63. I think that will let you drop the 'use open' line.