Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
P is for Practical
 
PerlMonks  

Re^2: Locale and Unicode, enemies in perl?

by andal (Friar)
on Apr 14, 2011 at 06:43 UTC ( #899361=note: print w/ replies, xml ) Need Help??


in reply to Re: Locale and Unicode, enemies in perl?
in thread Locale and Unicode, enemies in perl?

If you remove that use locale you have there, then you will get wšr on every Perl version Iíve tested it with, all the way bakc to 5.8.0 and all the way up to 5.14 RC0.

That's exactly why I've written my post :) "use locale" breaks the Unicode.


Comment on Re^2: Locale and Unicode, enemies in perl?
Re^3: Locale and Unicode, enemies in perl?
by tchrist (Pilgrim) on Apr 14, 2011 at 14:57 UTC
    That's exactly why I've written my post :) use locale breaks the Unicode.

    Oh, this is quite well-known: POSIX locales are for bad old legacy scripts that arenít Unicode-aware, and which rely on some 8-bit encoding for binary bytes, its LC_CTYPE and/or LC_COLLATE values, rather than setting the encoding properly and reading everything into Unicode characters instead of icky locale bytes.

    Unicode provides for much more robust character handling than do POSIX locales, whether this is for case mapping, collating, or really anything else having to do with characters.

    Here is a relevant excerpt from Perl 5.14ís perllocale manpage, with underlining mine:

    perl 5.14ís perllocale manpage says...

    Locales these days have been mostly been supplanted by Unicode, but Perl continues to support them.

    The support of Unicode is new starting from Perl version 5.6, and more fully implemented in version 5.8, and later. See the perluniintro manpage. Perl tries to work with both Unicode and locales. But, of course, there are problems.

    Perl does not handle multi-byte locales, such as have been used for various Asian languages, such as Big5 or Shift JIS. However, the multi-byte, increasingly common, UTF-8 locales, if properly implemented, tend to work reasonably well in Perl, simply because both they and Perl store the characters that take up multiple bytes the same way.

    Perl generally takes the tack to use locale rules on code points that can fit in a single byte, and Unicode rules for those that canít (though this wasnít uniformly applied prior to Perl 5.14). This prevents many problems in locales that arenít UTF-8. Suppose the locale is ISO8859-7, Greek. The character at 0xD7 there is a capital Chi. But in the ISO8859-1 locale, Latin1, it is a multiplication sign. The POSIX regular expression character class [[:alpha:]] will magically match 0xD7 in the Greek locale, but not in the Latin, even if the string is encoded in UTF-8, which normally would imply Unicode. (The ďUĒ in UTF-8 stands for Unicode.)

    However, there are places where this breaks down. Certain constructs are for Unicode only, such as \p{Alpha}. They assume that 0xD7 always has the Unicode meaning (or its equivalent on EBCDIC platforms). Since Latin1 is a subset of Unicode, 0xD7 is the multiplication sign in Unicode, so \p{Alpha} will not match it, regardless of locale. A similar issue happens with \N{...}. Therefore, it is a bad idea to use \p{} or \N{} under locale unless you know that the locale is always going to be ISO8859-1 or a UTF-8 one. Use the POSIX character classes instead.

    The same problem ensues if you enable automatic UTF-8-ification of your standard file handles, default open() layer, and @ARGV on non-ISO8859-1, non-UTF-8 locales (by using either the -C command line switch or the PERL_UNICODE environment variable; see the perlrun manpage for the documentation of the -C switch). Things are read in as UTF-8 which would normally imply a Unicode interpretation, but the presence of locale causes them to be interpreted in that locale, so a 0xD7 code point in the input will have meant the multiplication sign, but wonít be interpreted by Perl that way in the Greek locale. Again, this is not a problem if you know that the locales are always going to be ISO8859-1 or UTF-8.

    Vendor locales are notoriously buggy, and it is difficult for Perl to test its locale handling code because it interacts with code that Perl has no control over; therefore the locale handling code in Perl may be buggy as well. But if you do have locales that work, it may be worthwhile using them, keeping in mind the gotchas already mentioned. Locale collation is faster than Unicode::Collate, for example, and you gain access to things such as the currency symbol and days of the week.

    BUGS

    Broken systems

    In certain systems, the operating systemís locale support is broken and cannot be fixed or used by Perl. Such deficiencies can and will result in mysterious hangs and/or Perl core dumps when the use locale is in effect. When confronted with such a system, please report in excruciating detail to <perlbug@perl.org>, and complain to your vendor: bug fixes may exist for these problems in your operating system. Sometimes such bug fixes are called an operating system upgrade.

    My personal advice is to strongly avoid vendor locales. Itís not a legacy you want to see propagated.

      Note also that often, one doesn't have the choice between Unicode and locales. As a glue language, many Perl programs are written that just have to deal with data produced by other programs - and its format is given.

      Another reason why Perl should keep supporting locales.

        Note also that often, one doesn't have the choice between Unicode and locales. As a glue language, many Perl programs are written that just have to deal with data produced by other programs - and its format is given. Another reason why Perl should keep supporting locales.
        Could you please explain what you mean by that?

        There is a super-huge difference between supporting locales for I/O layers and expecting Perl to subvert its entire internal character representation scheme of Unicode. The former makes sound sense; the latter, does not.

        Itís one thing to be able to handle input and output in some particular locale-dependent encoding ó say hr_HR.ISO8859-2, zh_HK.Big5HKSCS, ru_RU.koi8r, or sv_SE.ISO8859-15.

        However, itís quite another to demand that Perl support a completely different scheme for how it internally stores and handles its own characters. That really is not reasonable. Render unto Caesar the things that are Caesarís and all that: the outside world does not get to impose its own provincial ideas on how Perl stores and handles its own characters! Nobody should expect Perl to store the characters in its own memory using some ancient and antiquated Microsoft byte encoding, let alone follow its silly rules.

        The sole reason the locale facility even exists in the first place is because it happened to enter Perl before Unicode did. It should be considered nothing more than a tiny corner in which a niche legacy continues to be supported, kinda-sorta and rather limply, for no other reason than so pre-existing Perl programs might continue to work in legacy mode without requiring any updates.

        System locales are a terrible pain, and a great way to write code that is at best guaranteed to be completely anti-portable. The sooner people upgrade from their legacy codesets, the better.

        Note that I am specifically referring here to such matters as LC_CTYPE and LC_COLLATE. I am not talking about things like money or dates.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://899361]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (11)
As of 2014-04-23 20:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (554 votes), past polls