Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

look-ahead greed changed between perl releases?

by raygun (Scribe)
on May 06, 2014 at 16:14 UTC ( [id://1085201]=perlquestion: print w/replies, xml ) Need Help??

raygun has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks. I'm trying to understand how greed plays out in a look-ahead assertion, and why it seems to have changed between perl 5.12.4 and 5.16.3.

This works as I expect:

$s = 'alter string-N-U-M-B-E-R-O-N-E and string-N-U-M-B-E-R-T-W-O.'; $s =~ s/(\S+([[:upper:]]-){2,}[[:upper:]]\b)/%$1/g; print "$s\n";
and prints "alter %string-N-U-M-B-E-R-O-N-E and %string-N-U-M-B-E-R-T-W-O." Because the + is greedy, it picks up all nonspace characters before the capital-letters-separated-by-hyphens portion, and because the {2,} is greedy, it picks up the maximum amount of hyphenated capitals.

But these greediness properties do not work when the entire expression is cast as a look-ahead assertion (which allows one to remove the reference in the replacement pattern). Replace the substitution line with:

$s =~ s/(?=\S+([[:upper:]]-){2,}[[:upper:]]\b)/%/g;
to see what I mean. The output is now "alter %s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%R%-O-N-E and %s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%R%-T-W-O." This seems to indicate that neither the + nor the {2,} are being greedy anymore.

This worked as I expected in perl 5.12.4, but no longer works in perl 5.16.3. Is there something about the way greed was interpreted in look-ahead assertions that changed between these releases? Thanks for your insight.

Replies are listed 'Best First'.
Re: look-ahead greed changed between perl releases?
by kcott (Archbishop) on May 06, 2014 at 17:59 UTC

    G'day raygun,

    That was a bug which was fixed in 5.14.0.

    From perl5140delta: Regular Expression Bug Fixes:

    • A pattern containing a + inside a lookahead would sometimes cause an incorrect match failure in a global match (for example, /(?=(\S+))/g) [perl #68564].

    -- Ken

      Aha! I had overlooked that. Thanks, Ken!
Re: look-ahead greed changed between perl releases? (before and after the fix)
by kcott (Archbishop) on May 06, 2014 at 18:55 UTC

    As a follow-up to my post above where I pointed to the documented bug fix, here's a short script (with simplified data and regex) which demonstrates what's happening before and after the 5.14.0 fix.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; say "Perl Version: $^V"; my $test_string = 'www xxx-A-B-C-D yyy zzz-E-F-G-H'; my $re = qr< \S+ (?: [A-Z]- ){2,} [A-Z] \b >x; say '=' x 60, "\n*** NO look-ahead ***\n", '=' x 60; my $no_look_ahead_test_string = $test_string; while ($no_look_ahead_test_string =~ /($re)/gp) { say "Prematch: '${^PREMATCH}'"; say "Match: '${^MATCH}'"; say "Postmatch: '${^POSTMATCH}'"; say '-' x 60; } say '=' x 60, "\n*** USING look-ahead ***\n", '=' x 60; my $look_ahead_test_string = $test_string; while ($look_ahead_test_string =~ /(?=$re)/gp) { say "Prematch: '${^PREMATCH}'"; say "Match: '${^MATCH}'"; say "Postmatch: '${^POSTMATCH}'"; say '-' x 60; }

    The nearest versions I have to 5.14.0 are 5.12.3 and 5.14.2.

    Here's the 5.12.3 output:

    Perl Version: v5.12.3 ============================================================ *** NO look-ahead *** ============================================================ Prematch: 'www ' Match: 'xxx-A-B-C-D' Postmatch: ' yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy ' Match: 'zzz-E-F-G-H' Postmatch: '' ------------------------------------------------------------ ============================================================ *** USING look-ahead *** ============================================================ Prematch: 'www ' Match: '' Postmatch: 'xxx-A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy ' Match: '' Postmatch: 'zzz-E-F-G-H' ------------------------------------------------------------

    Here's the 5.14.2 output:

    Perl Version: v5.14.2 ============================================================ *** NO look-ahead *** ============================================================ Prematch: 'www ' Match: 'xxx-A-B-C-D' Postmatch: ' yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy ' Match: 'zzz-E-F-G-H' Postmatch: '' ------------------------------------------------------------ ============================================================ *** USING look-ahead *** ============================================================ Prematch: 'www ' Match: '' Postmatch: 'xxx-A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www x' Match: '' Postmatch: 'xx-A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xx' Match: '' Postmatch: 'x-A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx' Match: '' Postmatch: '-A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-' Match: '' Postmatch: 'A-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A' Match: '' Postmatch: '-B-C-D yyy zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy ' Match: '' Postmatch: 'zzz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy z' Match: '' Postmatch: 'zz-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy zz' Match: '' Postmatch: 'z-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy zzz' Match: '' Postmatch: '-E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy zzz-' Match: '' Postmatch: 'E-F-G-H' ------------------------------------------------------------ Prematch: 'www xxx-A-B-C-D yyy zzz-E' Match: '' Postmatch: '-F-G-H' ------------------------------------------------------------

    [In case you were wondering, I'm using perlbrew which made it easy to switch versions between runs. Editing the shebang line or calling the script with different paths to perl will work just as well.]

    -- Ken

Re: look-ahead greed changed between perl releases?
by amon (Scribe) on May 06, 2014 at 18:05 UTC

    You are mistaken in your assumption that the observed lookahead-behavior would be incorrect. Both the “+” and the “{2,}” are as greedy as they are outside of lookaheads, but the lookahead matches repeatedly because it doesn't consume any characters by itself.

    Let's play the manual regex engine game:

    ""."string-N-U-M-B-E-R-O-N-E" MATCH($1="string-N-U-M-B-E-R-O-N-") "%s"."tring-N-U-M-B-E-R-O-N-E" MATCH($1="tring-N-U-M-B-E-R-O-N-") "%s%t"."ring-N-U-M-B-E-R-O-N-E" MATCH($1="ring-N-U-M-B-E-R-O-N-") "%s%t%r"."ing-N-U-M-B-E-R-O-N-E" MATCH($1="ing-N-U-M-B-E-R-O-N-") "%s%t%r%i"."ng-N-U-M-B-E-R-O-N-E" MATCH($1="ng-N-U-M-B-E-R-O-N-") "%s%t%r%i%n"."g-N-U-M-B-E-R-O-N-E" MATCH($1="g-N-U-M-B-E-R-O-N-") "%s%t%r%i%n%g"."-N-U-M-B-E-R-O-N-E" MATCH($1="-N-U-M-B-E-R-O-N-") "%s%t%r%i%n%g%-"."N-U-M-B-E-R-O-N-E" MATCH($1="N-U-M-B-E-R-O-N-") "%s%t%r%i%n%g%-%N"."-U-M-B-E-R-O-N-E" MATCH($1="-U-M-B-E-R-O-N-") ... "%s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%E%-%R"."-O-N-E" MATCH("-O-N-E") "%s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%E%-%R%-"."O-N-E" FAIL "%s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%E%-%R%-O-N-E"."" FAIL

      Another variation:

      c:\@Work\Perl>perl -wMstrict -le "print $]; ;; my $s = 'ABCDEF'; ;; my @captures = $s =~ m{ (?= (.{3,})) }xmsg; printf qq{'$_' } for @captures; " 5.014004 'ABCDEF' 'BCDEF' 'CDEF' 'DEF'

      Interestingly, this particular example works exactly the same for all the regex expressions  (.{3,}) and  (.+) and  (.{1,}) on all the Perl versions I have in captivity (all Win32): ActiveState 5.8.9 and Strawberries 5.10.1.5, 5.12.3.0 and 5.14.4.1.
      (Of course, output differs between  (.{3,}) and  (.{1,}) but the latter and  (.+) are the same.)

      Thanks, amon, for the detailed illustration.
Re: look-ahead greed changed between perl releases?
by sn1987a (Deacon) on May 06, 2014 at 16:33 UTC

    Your reported behaviour in 5.16.3 is what I expect. I do not have a 5.12.4 handy to test.

    The look ahead is zero-width which means it does not consume any of the input. The regexp matches starting which the s so a % is placed before it. The regexp engine then resumes searching, at the next character and matches starting at the t. This process repeats until there is not enough hyphenated upper case letters to match.

      So that overrides the greediness precept? The perlre document isn't clear on this, so I wasn't sure which behavior was considered correct.

      I no longer have a 5.12.4 handy either: I discovered this issue after upgrading perl and finding my code was no longer working as it had before. Was this a bug in a 5.12.4? I can't find anything in the perl*delta docs that lead me to believe this is an intentional behavior change.

        No, it does not override the greediness. The regexp still matches that longest string. But this is just a look ahead. Once the match/substitution is made, the regexp engine starts searching again after the first character matched.

        amon below has a great graphical view of what is happening (Re: look-ahead greed changed between perl releases?).

Re: look-ahead greed changed between perl releases?
by AnomalousMonk (Archbishop) on May 07, 2014 at 04:16 UTC

    Just to add some more examples, here are 5.8 (which looks like what I would expect) thru 5.14. All are Strawberries except ActiveState 5.8.9. So it looks, as mentioned above, like something got b0rked in 5.10 and fixed in 5.14 (correct again).

    c:\@Work\Perl>perl -wMstrict -le "print $]; my $s = 'alter string-N-U-M-B-E-R-O-N-E and string-N-U-M-B-E-R-T-W-O. +'; $s =~ s/(?=\S+([[:upper:]]-){2,}[[:upper:]]\b)/%/g; print qq{'$s'}; " 5.008009 'alter %s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%R%-O-N-E and %s%t%r%i%n%g%-% +N%-%U%-%M%-%B%-%E%-%R%-T-W-O.' 5.010001 'alter %string-N-U-M-B-E-R-O-N-E and %string-N-U-M-B-E-R-T-W-O.' 5.012003 'alter %string-N-U-M-B-E-R-O-N-E and %string-N-U-M-B-E-R-T-W-O.' 5.014004 'alter %s%t%r%i%n%g%-%N%-%U%-%M%-%B%-%E%-%R%-O-N-E and %s%t%r%i%n%g%-% +N%-%U%-%M%-%B%-%E%-%R%-T-W-O.'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1085201]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-28 17:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found