Re^2: Regex Matching Unicode and Regex Classes

Hi Moritz,

but what is then the difference to the third case? Is the "default Unicode semantic" changed to something different when local is enabled?

Why is "U+00E4 LATIN SMALL LETTER A WITH DIAERESIS" under locale something different than a letter which is part of a word?

Best regards
Andreas

Comment on Re^2: Regex Matching Unicode and Regex Classes

Replies are listed 'Best First'.
Re^3: Regex Matching Unicode and Regex Classes by moritz (Cardinal) on Nov 02, 2011 at 14:41 UTC
Short answer: because Unicode and locales don't mix. Long answer: Perl's support for locales comes from a time before the whole encoding/decoding business and Unicode support. So if locales are active, the locale-sensitive parts expect to act on bytes, not on decoded strings. Since the locale is not ISO-8859-1 but UTF-8, encoding to Latin-1 doesn't fix it for you. If anything, you'd need to encode to UTF-8 to see the \w matching ä, but even then I don't see it matching. So either my understanding of locales is very wrong, or perl is broken (or a mixture thereof). Perl 6 - second systems done right	[reply]
Re^4: Regex Matching Unicode and Regex Classes by McA (Priest) on Nov 02, 2011 at 14:48 UTC
Hi Moritz, that sounds plausible, but not satisfying. ;-) What is then the right approach to find word boundaries with regex while locale is enabled? Best regards Andreas	[reply]
Re^5: Regex Matching Unicode and Regex Classes by moritz (Cardinal) on Nov 02, 2011 at 14:53 UTC
Don't enable a locale for that. When you need a locale, only enable it a lexical scope as small as possible. Perl 6 - second systems done right	[reply]
Re^5: Regex Matching Unicode and Regex Classes by choroba (Cardinal) on Nov 02, 2011 at 14:58 UTC
I gave you a solution in Re: Regex Matching Unicode and Regex Classes. But be careful and watch for bugs.	[reply]