comment on

I’m going to answer the last part first:

If it were done this way, then the code like
use utf8; use locale; my $tst = "wär war"; die "No match\n" unless $tst =~ /(\w+)/; print $1, "\n";
[download]
would produce correct output "wär" and not "w". More than that, the switch -C would not be required for running this code. Do I miss something in my understanding?

Perhaps.

If you remove that use locale you have there, then you will get wär on every Perl version I’ve tested it with, all the way bakc to 5.8.0 and all the way up to 5.14 RC0.

You mention not wanting to know to use -C, such as here, -CS. I run with the PERL_UNICODE envariable set appropriately, usually to SA but sometimes to SAD.

I believe that you will find that 5.14’s implementation of the unicode_strings feature will make everything “just work”, if you would. It won’t know to make your source in UTF-8 — that’s what use utf8 is for — nor will it know what you want done with the encoding of your I/O handles, but I really think it will make a great deal of jaw-clenching go away. It essentially fixes what’s come to be called “the Unicode bug”.

I have a very dim view of vendor locales. I work on too many systems where they do not work correctly. Also, Perl does not (currently) work with anything but legacy, 8-bit locales, even when you do everything else right. It specifically does not work with any UTF-8 locale. At least, not from a locale-aware point of view.

Now, we have for quite a long time been able to sort and compare things using the Unicode::Collate module. This does a lot more than people think it does, including not just case-insensitive but also diacritic-insensitive and punctuation-insensitive matching and sorting.

The 5.14 release does include the Unicode::Collate::Locale module, which as of this writing includes support for 59 different named locales. You can install it on earlier versions of Perl if you pull it in from CPAN.

The truth is I nearly never use them, because I find the default Unicode Collation Algorithm (UCA) good enough. I do have a special ucsort program to sort text using the UCA just like the regular sort program, but that’s mostly so you can pass it preprocessing options before it generates the sort key.

Now back to the first part:

As far as I understand, Unicode defines almost everything necessary for handling characters. At least Unicode support of perl provides lookup for various properties of characters ("\p{Uppercase}" etc.) I believe this is mostly enough for text matching and case conversion. Unicode also provides collation charts, but I don't know if they supported in perl. Anyway. The point is, perl is pretty smart with handling characters ones those are identified.

So the answer is that I believe that Perl already comes with everything you need. To start with, Perl understands all three Unicode cases: uppercase via the uc function, lowercase via the lc function, and titlecase via the ucfirst function.

When you’re matching, you can detect these with properties like \p{Upper} and \p{Lower}. Those are binary properties which are not quite the same as Other interesting properties include:

Case_Ignorable CI
LC Cased Cased_Letter General_Category=Cased_Letter
CWCF Changes_When_Casefolded
CWCM Changes_When_Casemapped
CWL Changes_When_Lowercased
CWKCF Changes_When_NFKC_Casefolded
CWT Changes_When_Titlecased
CWU Changes_When_Uppercased

You can use those properties to discover that there are 828 lowercase code points as of Unicode 6.0.0 that do not change case when uppercased. Amazing but true.

Another thing to know is that Perl’s case-insensitive pattern matching using full casefolding, not simple. That means that a pattern like /s/i will match not just upper- and lowercase s, but also the old-style long s, ſ.

If you have diacritics, you will get used to decomposing your strings, so that you can match, say, an e or an ë or an ễ or many other things. With some help from the Unicode::Collate module, you can get it to match even more that that, too.

Beyond that, you might also want things like Unicode::LineBreak and Unicode::GCString from CPAN.

In reply to Re: Locale and Unicode, enemies in perl? by tchrist
in thread Locale and Unicode, enemies in perl? by andal

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks