Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Matching behavior with (?^u)

by snoozeagain (Initiate)
on Sep 13, 2012 at 15:52 UTC ( #993522=perlquestion: print w/replies, xml ) Need Help??
snoozeagain has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

can someone enlighten me as to why two of the matches below fail in 5.14 onward. The only difference I see seems to be the implicitly added u modifier, which according to perlre “is not likely to be of much use to me, and so I need not worry about it very much.”

use utf8; use 5.14.0; use Data::Dumper; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; use Encode; binmode STDOUT, ":encoding(utf-8)"; my $s = ""; my @res = ( qr/\N{U+00e4}/, qr/\N{U+00e4}+/, qr/\xe4/, qr/\xe4+/, qr/[\N{U+00e4}]/, qr/[\N{U+00e4}]+/, qr/[\xe4]/, qr/[\xe4]+/, qr//, qr/+/, qr/[]/, qr/[]+/, ); for my $re (@res) { my $m = ($s =~ $re) ? "true" : "false"; printf "%s =~ %30s == %s\n", $s, decode("UTF-8",Dumper($re)), $m; }
For me, using 5.14.2 or 5.16.0 this returns
~/perl5/perlbrew/perls/perl-5.16.0/bin/perl =~ qr/(?^u:\N{U+00e4})/ == true =~ qr/(?^u:\N{U+00e4}+)/ == true =~ qr/(?^u:\xe4)/ == true =~ qr/(?^u:\xe4+)/ == true =~ qr/(?^u:[\N{U+00e4}])/ == true =~ qr/(?^u:[\N{U+00e4}]+)/ == false =~ qr/(?^u:[\xe4])/ == true =~ qr/(?^u:[\xe4]+)/ == true =~ qr/(?^u:)/ == true =~ qr/(?^u:+)/ == true =~ qr/(?^u:[])/ == true =~ qr/(?^u:[]+)/ == false
while on 5.12.4 everything matches.


Replies are listed 'Best First'.
Re: Matching behavior with (?^u)
by dave_the_m (Prior) on Sep 14, 2012 at 06:52 UTC
    This was fixed in 5.17.3 by the following commit:
    commit 34b39fc9cd81fbff0d52451a5c4570293817ca32 Author: Karl Williamson <> Date: Thu Aug 9 14:38:03 2012 -0600 regcomp.c: Set flags when optimizing a [char class] A bracketed character class containing a single Latin1-range chara +cter has long been optimized into an EXACT node. Also, flags are set t +o include SIMPLE. However, EXACT nodes containing code points that +are different when encoded under UTF-8 versus not UTF-8 should not be +marked simple. To fix this, the address of the flags parameter is now passed to regclass(), the function that parses bracketed character classes, +which now sets it appropriately. The unconditional setting of SIMPLE th +at was always done in the code after calling regclass() has been removed. In addition, the setting of the flags for EXACT nodes has been pus +hed into the common function that populates them. regclass() will also now increment the naughtiness count if optimi +zed to a node that normally does that. I do not understand this heuristi +c behavior very well, and could not come up with a test case for it; experimentation revealed that there are no test cases in our test +suite for which naughtiness makes any difference at all.


      Thanks Dave.

      “there are no test cases in our test suite for which naughtiness makes any difference at all”—at least that made for an amusing start into the day…

Re: Matching behavior with (?^u)
by tobyink (Abbot) on Sep 13, 2012 at 16:02 UTC

    Hmm... odd.

    Interesting to note the difference between qr/[]+/, qr/[X]+/ and qr/[X]+/.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://993522]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (15)
As of 2017-07-25 13:50 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (371 votes). Check out past polls.