http://www.perlmonks.org?node_id=799249

kappa has asked for the wisdom of the Perl Monks concerning the following question:

Good day, fellow monks!

That's what we read in perl5110delta:

The key change here is that \d will no longer match every digit in the + unicode standard (there are thousands) nor will \w match every word +character in the standard, instead they will match precisely their PO +SIX or Perl definition.

AFAIU, that means \w will no more match non-ascii letters in Unicode strings. I have just built a fresh 5.11 and I don't witness this change:

% perl -v # my system perl This is perl, v5.10.0 built for i486-linux-gnu-thread-multi % perl -E '"\x{432}" =~ /\w/ and say "matched"' # cyrillic letter v matched
Now on to 5.11:
% ~/work/perl-5.11.0/perl -v This is perl, v5.11.0 (*) built for i686-linux ... % ~/work/perl-5.11.0/perl -I lib -CS -E '"\x{432}" =~ /\w/ and say "ma +tched"' matched

On one hand, this is quite a relief as it means I don't have a lot of very broken code on me. On the other hand, this contradicts my understanding of the declared change.

Can someone enlighten me?

--kap

Replies are listed 'Best First'.
Re: Regarding the new \w regexp escape in 5.11
by demerphq (Chancellor) on Oct 05, 2009 at 14:45 UTC

    I goofed. There are three codepaths for \s and \w and \d and i missed two(!!). Yes. I am embarrassed. Especially that it escaped into the wild without anyone noticing. It is a dev release tho.

    Also just because your code is totally b0rked with no easy work around right now in 5.11.0 doesn't mean that 5.12 will have the same problem. The sky is officially NOT falling.

    ---
    $world=~s/war/peace/g

      If we're going to make this change (which appears to be compatible with other Unicode-handling modern regexen such as Python and PCRE), we should at least provide a way out for the user who wants true Unicode support without having to jump through lots of hoops. Python, for example, does this with (?u). Since Perl 5 uses (?letter) to map to the modifier letters, it seems obvious to make this a modifier :u, which should probably be turned on by default with "use locale".

      Doing that gives the expected behavior for POSIX-friendly uses and yet avoids snubbing users of P5 regexes who routinely match text from other languages/regions.

      naïveté (n) - Assuming your experiences map cleanly to the set of all experiences....

        Im wondering if you somehow dropped a "not" in your first parenthetical remark.

        The fundamental problem here is that \w and behaves different if the string is utf8 or not. We want to make it so \w does the same thing regardless. That means that we end up breaking someones code. I really dont want to have to support three modes, one for the current broken behaviour, one for utf8 and one for ascii. I would much rather just support one mode, and have it be able to cover all the bases. Whether this is feasable or not going forward isnt clear.

        Feel free to provide more details on how these issues are tackled in other languages.

        ---
        $world=~s/war/peace/g

Re: Regarding the new \w regexp escape in 5.11
by ikegami (Patriarch) on Oct 05, 2009 at 14:41 UTC
    The change isn't fully implemented yet. The problem is known and will be fixed.

      Thanks!

      Reading demerphq comments I understand that the way it's going to be fixed is not decided yet. So I'll watch this space closely now :)

      --kap

        This seems definite:

        Any version Before 5.12 Deterministic Behaviour Only in 5.12+
        To match [0-9] /[0-9]/ /\d/ (sometimes)
        /[[:digit:]]/ (sometimes)
        /\d/ (always) /\p{PosixDigit}/
        To match a Unicode Digit /\p{Digit}/ /\d/ (sometimes)
        /[[:digit:]]/ (sometimes)

        I don't know what's going to happen to /[[:digit:]]/.

        It's not clear (if it's been decided) if the deterministic behaviour will be active by default in 5.12+, or if a pragma (use 5.012;) will activiate it.

        If the deterministic behaviour will be active by default in 5.12+, there may be a pragma to deactivate it.