Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^2: Is this a bug in perl regex engine or in my brain?

by Crackers2 (Parson)
on Oct 06, 2015 at 17:58 UTC ( #1143957=note: print w/replies, xml ) Need Help??


in reply to Re: Is this a bug in perl regex engine or in my brain?
in thread Is this a bug in perl regex engine or in my brain?

Weird. I can reproduce OPs issue:
$ cat /tmp/x my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]'; while (<>) { chomp; if ($_ =~ /^$regex$/) { print "$_ matched\n"; } else { print "$_ did not match\n"; } } $ perl /tmp/x 100 100 matched 200 200 matched 300 300 matched ^C
In fact adding any "|<something>" seems to trigger it, i.e.
my $regex = '(2[0-4]|1?[0-9])?[0-9]|a';
gives the exact same result, and additionally matches anything starting with "a".

Aha. Looks like switching from

my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
to
my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
seems to fix it. I don't immediately see why though.

Replies are listed 'Best First'.
Re^3: Is this a bug in perl regex engine or in my brain?
by AnomalousMonk (Bishop) on Oct 06, 2015 at 20:48 UTC
    Looks like switching from
    my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
    to
    my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
    seems to fix it. I don't immediately see why though.

    It's a regex metacharacter/operator precedence issue.

    The regex  | (alternation) operator has a low (the lowest?) precedence among regex operators. When a raw string like
        my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]';
    is interpolated into
        /^$regex$/
    the final regex becomes
        /^(2[0-4]|1?[0-9])?[0-9]|25[0-5]$/

    The  ^ start-of-string assertion is effectively grouped and evaluated with the  (2[0-4]|1?[0-9])?[0-9] expression and disconnected by the alternation from the  25[0-5]$ expression. IOW, the regex will match any string with a  [0-9] at the minimum (everything else is optional) at the start or with a  25[0-5] at the end, and nothing else in the string matters!

    c:\@Work\Perl\monks>perl -wMstrict -le "my $regex = '(2[0-4]|1?[0-9])?[0-9]|25[0-5]'; while (<>) { chomp; if ($_ =~ /^$regex$/) { print qq{'$_' matched}; } else { print qq{'$_' did not match}; } } 100 '100' matched z100 'z100' did not match z255 'z255' matched z250 'z250' matched 100z '100z' matched 99 '99' matched 9999999 '9999999' matched 99Yikes!99 '99Yikes!99' matched 1 '1' matched 11 '11' matched 111 '111' matched 22 '22' matched 222 '222' matched 33 '33' matched 333 '333' matched

    In contrast, choroba used a  qr// operator to define the  $regex object (in fact, a Regexp object). (Update: See  qr// in Regexp Quote-Like Operators in perlop.) This is not the same as a raw string! Among other things, the  qr// operator adds a non-capturing  (?:pat) group around the whole expression that, in this application, effectively preserves the desired association between start- and end-of-string assertions after interpolation:
        my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/;
    becomes
        (?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])
    and is interpolated into
        /^$regex$/
    as
        /^(?:(2[0-4]|1?[0-9])?[0-9]|25[0-5])$/
    which can be read as "start-of-string, then one of a set of alternations in the range 0-255, then end-of-string" and which gives the desired number range discrimination.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $regex = qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/; while (<>) { chomp; if ($_ =~ /^$regex$/) { print qq{'$_' matched}; } else { print qq{'$_' did not match}; } } " 0 '0' matched 1 '1' matched 100 '100' matched 1000 '1000' did not match 25 '25' matched 255 '255' matched 256 '256' did not match a1 'a1' did not match 1a '1a' did not match 11 '11' matched 111 '111' matched 222 '222' matched 333 '333' did not match

    Bottom line: Wherever possible, prefer  qr// to raw strings for regex expressions.

    Please see perlre, perlretut, and perlrequick.

    Update: Incidentally, the regex  qr/(2[0-4]|1?[0-9])?[0-9]|25[0-5]/ does not match the strings  000 001 012 etc. (Update: The regex does match  00 01 02 etc.) If this is an issue, I suggest
        qr{ [01]? \d? \d | 2 [0-4] \d | 25 [0-5] }xms
    instead, but whatever you use, verify it with something like Test::More as choroba did!


    Give a man a fish:  <%-{-{-{-<

      Thanks for all the replies, helped a lot.

      @choroba - the test code as given didn't help me much... guessing I need to read up on Test::More. Using qr// - I have now picked that up, cheers.

      @AnomalousMonk - had to read it twice but now I know the result was what it was not because 'the computer doesn't like me' ;) Thanks.

      @Athanasius - I added the first ? quatifier so the expression matches 10-99, didn't indeed realise it introduces an unexpected match as well ...

      @Crackers - spotted and fixed the above, cheers :)

      @graff - all the spotting and fixing of errors above makes your case indeed... my take on it would be that regexp are good as long as you don't get in deeper than you can easily troubleshoot yourself out of, and that (getting in too deep) is very easily done.

      @Discipulus - yes I am reinventing ipv4 address matching, partially as an excercise. Thanks for the links, heading over there now.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1143957]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2020-10-29 02:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (266 votes). Check out past polls.

    Notices?