Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Locale and Unicode, enemies in perl?

by andal (Friar)
on Mar 18, 2011 at 08:44 UTC ( #893944=perlmeditation: print w/ replies, xml ) Need Help??

After reading perldoc perlunicode it seems that there's some conflict in perl between support for locales and unicode. At least "use locale" breaks certain features of unicode that work without it. This got me puzzled. From general consideration, there should be nothing like that. Of course, it might be that my "general considerations" are simply wrong, so I've decided to ask for opinion of other perl developers.

As far as I understand, Unicode defines almost everything necessary for handling characters. At least Unicode support of perl provides lookup for various properties of characters ("\p{Uppercase}" etc.) I believe this is mostly enough for text matching and case conversion. Unicode also provides collation charts, but I don't know if they supported in perl. Anyway. The point is, perl is pretty smart with handling characters ones those are identified.

Where comes the conflict with locales from? Again, as far as I understand, locale defines set of rules that are common for the environment. These rules include collation for sorting, characters encoding, language of messages etc. All of this is advisory. So, it shouldn't come into conflict with anything. Why does it conflict with perl operation?

In general, I would believe that locale settings should be the source of defaults for perl. For example, in the absence of "use utf8", the perl should believe that the file is encoded using character set defined in locale. Again, in the absence of explicit "binmode" for file handles, the perl should believe that the input is encoded using character set defined in locale. This should help perl with conversion from octets into unicode characters. Once this conversion is done, the locale setting is not needed any more. This means, that string matching should not care about locale, unless it got octects in place of characters for matching.

In short, the locale support should be just an extra level in providing defaults. If "use locale" is not present, then default encoding for "octects" is Latin1. In the presence of "use locale" the default encoding would be whatever defined by locale.

If it were done this way, then the code like

use utf8; use locale; my $tst = "wär war"; die "No match\n" unless $tst =~ /(\w+)/; print $1, "\n";
would produce correct output "wär" and not "w". More than that, the switch -C would not be required for running this code.

Do I miss something in my understanding?

Comment on Locale and Unicode, enemies in perl?
Download Code
Re: Locale and Unicode, enemies in perl?
by moritz (Cardinal) on Mar 18, 2011 at 09:41 UTC
    Unicode also provides collation charts, but I don't know if they supported in perl.

    As far as I know not in core, but Unicode::Collate should make them accessible.

    Where comes the conflict with locales from?

    Probably a lack of care. Supporting Unicode correctly is not easy, and supporting locales correctly isn't easy either (and hard to test, because you usually can only test those locales provided by the system). Supporting both together probably requires greate lengths that no core developer is willing to walk.

    So far I found Unicode to be sufficient for my text processing needs, and I guess that most core developers feel the same.

    If somebody stepped up and provided patches that made both of them work together, I'm sure they would be accepted.

Re: Locale and Unicode, enemies in perl?
by JavaFan (Canon) on Mar 18, 2011 at 13:29 UTC
    Where comes the conflict with locales from?
    There are many aspects to locale, but one of them is a mapping of integers to characters. But that's what Unicode does as well: in its heart, it maps integers to characters.

    Perl strings are a little bit more sophisticated than C strings, but all this sophistication is only on a technical level. But Perl strings are still pretty dumb: they're a sequence of integers.

    The problem comes when we consider a string that doesn't have the UTF-8 flag set (which is not only used internally - to tell how the integers mentioned above are encoded in bytes, but it's also signalling Unicode semantics apply). Suppose one of the integers of the string is 0xDF. Which character is it? Is it a letter? A number? Something else?

    Do I miss something in my understanding?
    Experts consider locale-support to be broken in Perl. Some of them are itching to declare them deprecated. Bugs are considered to complicated to fix.

    If you think there's an easy fix to get things working better than they are now, I'd think p5p would be quite interested in your patches.

Re: Locale and Unicode, enemies in perl?
by Eliya (Vicar) on Mar 18, 2011 at 14:28 UTC
    in the absence of "use utf8", the perl should believe that the file is encoded using character set defined in locale.

    I don't think this would be a good idea.  The encoding of source files is something that's tied to the files themselves, not the environment they're run in.  In other words, when moving Perl code to a different locale, you'd risk breaking things (unnecessarily)...

    Other than that, I agree with the tenor of your post and also do think it would be nice to have locales work in combination with Unicode.  After all, locales comprise more than just the definition of valid characters.

    However, I'm not proficient enough with locales (nor with the Perl sources) to help out with patches — so I'm not complaining...  (Heck, I'm not even sure how things are supposed to work in some aspects.  Let's say, with a locale setting of LC_CTYPE=de_DE.UTF-8, should all characters defined in Unicode match \w, or just the ones actually being used in the respective language/region?  For example, both 'ä' and '䕧' (U+4567) are valid letters according to Unicode, but the latter is not a valid letter in German, so one might argue it shouldn't match \w when the de_DE locale is in effect.)

      Heck, I'm not even sure how things are supposed to work in some aspects. Let's say, with a locale setting of LC_CTYPE=de_DE.UTF-8, should all characters defined in Unicode match \w, or just the ones actually being used in the respective language/region?

      IMHO there should be two primitives, for example \pL for matching Unicode letters, and [[:alpha:]] for locales-based letter matching.

      \w could then die with "please be more specific in your choice of character class" if locales are in effect.

Re: Locale and Unicode, enemies in perl?
by tchrist (Pilgrim) on Apr 10, 2011 at 01:09 UTC
    I’m going to answer the last part first:
    If it were done this way, then the code like
    use utf8; use locale; my $tst = "wär war"; die "No match\n" unless $tst =~ /(\w+)/; print $1, "\n";
    would produce correct output "wär" and not "w". More than that, the switch -C would not be required for running this code. Do I miss something in my understanding?
    Perhaps.

    If you remove that use locale you have there, then you will get wär on every Perl version I’ve tested it with, all the way bakc to 5.8.0 and all the way up to 5.14 RC0.

    You mention not wanting to know to use -C, such as here, -CS. I run with the PERL_UNICODE envariable set appropriately, usually to SA but sometimes to SAD.

    I believe that you will find that 5.14’s implementation of the unicode_strings feature will make everything “just work”, if you would. It won’t know to make your source in UTF-8 — that’s what use utf8 is for — nor will it know what you want done with the encoding of your I/O handles, but I really think it will make a great deal of jaw-clenching go away. It essentially fixes what’s come to be called “the Unicode bug”.

    I have a very dim view of vendor locales. I work on too many systems where they do not work correctly. Also, Perl does not (currently) work with anything but legacy, 8-bit locales, even when you do everything else right. It specifically does not work with any UTF-8 locale. At least, not from a locale-aware point of view.

    Now, we have for quite a long time been able to sort and compare things using the Unicode::Collate module. This does a lot more than people think it does, including not just case-insensitive but also diacritic-insensitive and punctuation-insensitive matching and sorting.

    The 5.14 release does include the Unicode::Collate::Locale module, which as of this writing includes support for 59 different named locales. You can install it on earlier versions of Perl if you pull it in from CPAN.

    The truth is I nearly never use them, because I find the default Unicode Collation Algorithm (UCA) good enough. I do have a special ucsort program to sort text using the UCA just like the regular sort program, but that’s mostly so you can pass it preprocessing options before it generates the sort key.

    Now back to the first part:

    As far as I understand, Unicode defines almost everything necessary for handling characters. At least Unicode support of perl provides lookup for various properties of characters ("\p{Uppercase}" etc.) I believe this is mostly enough for text matching and case conversion. Unicode also provides collation charts, but I don't know if they supported in perl. Anyway. The point is, perl is pretty smart with handling characters ones those are identified.

    So the answer is that I believe that Perl already comes with everything you need. To start with, Perl understands all three Unicode cases: uppercase via the uc function, lowercase via the lc function, and titlecase via the ucfirst function.

    When you’re matching, you can detect these with properties like \p{Upper} and \p{Lower}. Those are binary properties which are not quite the same as Other interesting properties include:

    • Case_Ignorable CI
    • LC Cased Cased_Letter General_Category=Cased_Letter
    • CWCF Changes_When_Casefolded
    • CWCM Changes_When_Casemapped
    • CWL Changes_When_Lowercased
    • CWKCF Changes_When_NFKC_Casefolded
    • CWT Changes_When_Titlecased
    • CWU Changes_When_Uppercased

    You can use those properties to discover that there are 828 lowercase code points as of Unicode 6.0.0 that do not change case when uppercased. Amazing but true.

    Another thing to know is that Perl’s case-insensitive pattern matching using full casefolding, not simple. That means that a pattern like /s/i will match not just upper- and lowercase s, but also the old-style long s, ſ.

    If you have diacritics, you will get used to decomposing your strings, so that you can match, say, an e or an ë or an or many other things. With some help from the Unicode::Collate module, you can get it to match even more that that, too.

    Beyond that, you might also want things like Unicode::LineBreak and Unicode::GCString from CPAN.

      If you remove that use locale you have there, then you will get wär on every Perl version I’ve tested it with, all the way bakc to 5.8.0 and all the way up to 5.14 RC0.

      That's exactly why I've written my post :) "use locale" breaks the Unicode.

        That's exactly why I've written my post :) use locale breaks the Unicode.

        Oh, this is quite well-known: POSIX locales are for bad old legacy scripts that aren’t Unicode-aware, and which rely on some 8-bit encoding for binary bytes, its LC_CTYPE and/or LC_COLLATE values, rather than setting the encoding properly and reading everything into Unicode characters instead of icky locale bytes.

        Unicode provides for much more robust character handling than do POSIX locales, whether this is for case mapping, collating, or really anything else having to do with characters.

        Here is a relevant excerpt from Perl 5.14’s perllocale manpage, with underlining mine:

        perl 5.14’s perllocale manpage says...

        Locales these days have been mostly been supplanted by Unicode, but Perl continues to support them.

        The support of Unicode is new starting from Perl version 5.6, and more fully implemented in version 5.8, and later. See the perluniintro manpage. Perl tries to work with both Unicode and locales. But, of course, there are problems.

        Perl does not handle multi-byte locales, such as have been used for various Asian languages, such as Big5 or Shift JIS. However, the multi-byte, increasingly common, UTF-8 locales, if properly implemented, tend to work reasonably well in Perl, simply because both they and Perl store the characters that take up multiple bytes the same way.

        Perl generally takes the tack to use locale rules on code points that can fit in a single byte, and Unicode rules for those that can’t (though this wasn’t uniformly applied prior to Perl 5.14). This prevents many problems in locales that aren’t UTF-8. Suppose the locale is ISO8859-7, Greek. The character at 0xD7 there is a capital Chi. But in the ISO8859-1 locale, Latin1, it is a multiplication sign. The POSIX regular expression character class [[:alpha:]] will magically match 0xD7 in the Greek locale, but not in the Latin, even if the string is encoded in UTF-8, which normally would imply Unicode. (The “U” in UTF-8 stands for Unicode.)

        However, there are places where this breaks down. Certain constructs are for Unicode only, such as \p{Alpha}. They assume that 0xD7 always has the Unicode meaning (or its equivalent on EBCDIC platforms). Since Latin1 is a subset of Unicode, 0xD7 is the multiplication sign in Unicode, so \p{Alpha} will not match it, regardless of locale. A similar issue happens with \N{...}. Therefore, it is a bad idea to use \p{} or \N{} under locale unless you know that the locale is always going to be ISO8859-1 or a UTF-8 one. Use the POSIX character classes instead.

        The same problem ensues if you enable automatic UTF-8-ification of your standard file handles, default open() layer, and @ARGV on non-ISO8859-1, non-UTF-8 locales (by using either the -C command line switch or the PERL_UNICODE environment variable; see the perlrun manpage for the documentation of the -C switch). Things are read in as UTF-8 which would normally imply a Unicode interpretation, but the presence of locale causes them to be interpreted in that locale, so a 0xD7 code point in the input will have meant the multiplication sign, but won’t be interpreted by Perl that way in the Greek locale. Again, this is not a problem if you know that the locales are always going to be ISO8859-1 or UTF-8.

        Vendor locales are notoriously buggy, and it is difficult for Perl to test its locale handling code because it interacts with code that Perl has no control over; therefore the locale handling code in Perl may be buggy as well. But if you do have locales that work, it may be worthwhile using them, keeping in mind the gotchas already mentioned. Locale collation is faster than Unicode::Collate, for example, and you gain access to things such as the currency symbol and days of the week.

        BUGS

        Broken systems

        In certain systems, the operating system’s locale support is broken and cannot be fixed or used by Perl. Such deficiencies can and will result in mysterious hangs and/or Perl core dumps when the use locale is in effect. When confronted with such a system, please report in excruciating detail to <perlbug@perl.org>, and complain to your vendor: bug fixes may exist for these problems in your operating system. Sometimes such bug fixes are called an operating system upgrade.

        My personal advice is to strongly avoid vendor locales. It’s not a legacy you want to see propagated.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://893944]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2014-07-12 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (238 votes), past polls