http://www.perlmonks.org?node_id=893588

december has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks,

I'm have a load of unicode/locale problems (more posts may be forthcoming) on a properly configured Linux system and I'm hoping someone can point out some of the reasons things are not turning out how I want them to turn out.

Observe:

# note: system locale works properly $ LC_CTYPE=fi_FI.UTF-8; export LC_CTYPE # WITH use locale: $ perl -CSA -e 'use utf8; use open ":std", IO => ":utf8"; use locale; +use POSIX qw(locale_h); print "LC_CTYPE: ", setlocale(LC_CTYPE), "\n" +; $str = "säv sov san"; my @arr = $str =~ /\b(\w{3})\b/g; print "", j +oin("|", @arr), "\n";' LC_CTYPE: fi_FI.UTF-8 sov|san # WITHOUT use locale: $ perl -CSA -e 'use utf8; use open ":std", IO => ":utf8"; use POSIX qw +(locale_h); print "LC_CTYPE: ", setlocale(LC_CTYPE), "\n"; $str = "sä +v sov san"; my @arr = $str =~ /\b(\w{3})\b/g; print "", join("|", @ar +r), "\n";' LC_CTYPE: fi_FI.UTF-8 säv|sov|san

... use locale breaks '\w'. It doesn't consider "säv" to be consisting of "word" characters. Since it's a Nordic UTF-8 locale, '\w' should contain "öäå". It works without it, strangely enough.

BTW, even when dropped the -CSA unicode argument to Perl, use open works as expected with both :utf8 and :locale, so Perl does read my locale settings.

What gives?



PS: Perl v5.10.1 on Debian "unstable".

Replies are listed 'Best First'.
Re: use locale broken?
by moritz (Cardinal) on Mar 16, 2011 at 19:20 UTC

    If you use properly decoded strings (which you do, since use utf8; is in effect) and no locales, \w, \d etc. follow Unicode semantics, which means they match more than the basic Latin characters.

    I'm not very familiar with locales, but I guess that it expects the strings to be non-decoded binary strings in the encoding specified in the locale (here: UTF-8), so it might work without the utf8 pragma.

    In general I recommend against locales, if you can avoid them. In my experience they are always a source of trouble, and don't bring the promised "do what I mean"-effect.

      It seems that use locale just doesn't work well for UNICODE character sets, because it doesn't consider these locale-specific characters valid word characters. I think it's a problem in Perl, because clearly \w should include "öäå" Scandinavian characters when such a locale is in effect, UNICODE or not.

      But well, I can avoid buggy locale handling by explicitly converting all input and output to UNICODE, regardless of the user's settings. I just wish it would have worked...

Re: use locale broken?
by andal (Hermit) on Mar 17, 2011 at 08:53 UTC

    Read the perldoc perlunicode. There you'll find

       Interaction with Locales
           Use of locales with Unicode data may lead to odd results.  Currently, Perl attempts to
           attach 8-bit locale info to characters in the range 0..255, but this technique is
           demonstrably incorrect for locales that use characters above that range when mapped into
           Unicode.  Perl's Unicode support will also tend to run slower.  Use of locales with
           Unicode is discouraged.
    
    In other words, since you are using UTF-8 encoding for your locale, you don't need to "use locale" in your program. The perl shall use appropriate UNICODE definitions to handle your strings. When you request locale support, you confuse perl and get unexpected things.

    Basically, with UNICODE support of perl, you don't need to worry about locale. The locale settings become important only when the data leaves perl script. When this happens, the environment (for example shell) gets just sequence of bytes, which have to be somehow interpreted. The locale define, how they will be interpreted. So your perl code has to make sure that the data it outputs is suitable for the interpretation. So, effectively, you just need to make sure that your file-handles output data in correct encoding.

      I was hoping to have it work both when the user (shell) encoding is in either ISO-8859-1 or UTF-8. Maybe I'm better off forcefully converting all input and output to UTF-8 and have the code itself dealing with UNICODE only.

      I still feel this is a bug in Perl, though.

      Is there a way – perhaps debugging argument – to see what \w applies to?

        Maybe I'm better off forcefully converting all input and output to UTF-8

        Yes. For many reasons, it is best to decode all inputs, and encode all output.

        I still feel this is a bug in Perl, though.

        I believe Perl doesn't support multi-byte locales (e.g. UTF-8).

        Effort is placed on Unicode instead instead of adding to the locale system.

        Is there a way – perhaps debugging argument – to see what \w applies to?

        perlre: Match a "word" character (alphanumeric plus "_").

        The following are equivalent:

        ( No, this is wrong )

        /\w/ # When no locale, when not restricted to ASCII /\p{Word}/ /[_\p{Alnum}]/ /[_\p{Alphabetic}\p{Nd}]/

        Derived property "Alphabetic". (100,520 codepoints in Perl 5.12.2)
        Unicode character category "Nd". (411 codepoints in Perl 5.12.2)

        Actual lists vary by version of Unicode and thus by version of Perl.