Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Greediness of * vs. greediness of +

by rovf (Priest)
on Sep 08, 2010 at 09:17 UTC ( #859294=perlquestion: print w/ replies, xml ) Need Help??
rovf has asked for the wisdom of the Perl Monks concerning the following question:

$ perl -lwe '"abbbbc"=~/(b+)/ && print "Found: $1"' Found: bbbb $ perl -lwe '"abbbbc"=~/(b*)/ && print "Found: $1"' Found:
In both cases, the regexp succeed. In the first case, b+ picks up all four b's. In the second case, b* picks up zero b's. Why is + greedy in this example, but * is not? I try to find enlightment at perlre, but with no success.

-- 
Ronald Fischer <ynnor@mm.st>

Comment on Greediness of * vs. greediness of +
Download Code
Re: Greediness of * vs. greediness of +
by moritz (Cardinal) on Sep 08, 2010 at 09:29 UTC
    The overriding principle of the regex engine is to try the left-most possible match first.

    In the case of b*, it matches zero times at the start of the string (ie before the first character) - which is as many as it can, because there are not more than 0 leading b's in your string.

    In the case of b+, the empty match is not successful, so the regex engine bumps along, and tries to match as many b's as possible, starting from the second position (ie string index 1).

    So both are greedy, but neither do a search for longest match of consecutive b's. try

    $ perl -lwe '"abbabbbbc"=~/(b+)/ && print "Found: $1"' Found: bb

    It used the first position where b+ matched, and there took as many b's as possible.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Greediness of * vs. greediness of +
by Marshall (Prior) on Sep 08, 2010 at 09:40 UTC
    Below are a couple of more patterns. In the first case, the expression just isn't getting to the b's. When it sees the a, that's zero or more b's (i.e. zero).

    In the second case, the "." says any first character then zero or more b's, since we got to the b's, we get them all. Same thing in the 3rd example.

    "abbbbc"=~/(b*)/ && print "Found: $1\n"; #(no b's) "abbbbc"=~/.(b*)/ && print "Found: $1\n"; #bbbb "bbbbbc"=~/(b*)/ && print "Found: $1\n"; #bbbbb
Re: Greediness of * vs. greediness of +
by rovf (Priest) on Sep 08, 2010 at 10:21 UTC
    In the case of b*, it matches zero times at the start of the string
    Ah, of course!!! Now that you say it, it becomes obvious!!!

    Thanks a lot.

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Greediness of * vs. greediness of +
by ww (Bishop) on Sep 08, 2010 at 13:00 UTC

    This is a "further" question re the replies from moritz and Marshall.

    And in following their answers, I found it helpful to remember that "greedy" is not the same as "global." While (b*) is greedy, it is not global, /(b*)/g. In other words (I thought) /(b*)/ stops after the first (moritz caught this)failure success , at the start of the string, whereas adding the /g would tell the regex engine to keep on trying until it reaches the end of the string.

    Well, that was my second thought.

    But, OOOPS,

    "abbbbc"=~/(b*)/g && print "Found: $1"; # Found:

    Huh?

    Well, is this a case where the rules are different in substitution?

    my $string1 = "abbbbc"; my $found1 = $string1 =~ s/(b*)/^/g; print "\$string1: $string1, \$found1: $found1"; $found1 is 4 =head after s/// $string: ^a^^c^ At begin of string ...no "b" found | # satisfies "0 or more 'b's" "a" (duh!) | Two "b"(s) found | # likewise; the first and second 'b's ar +e conflated? and again | "c" | no "b" after "c" | =cut

    That this code produces two replacements for the string of four 'b's remains a puzzle. Why does this appear (this may be my error) that the regex conflates two 'b's rather than all four?

    Enlightenment?
    or Coffee?

    Update: Wonderful answer below. s/failure/success/ per moritz; italics closed per ssandv.

      In other words (I thought) /(b*)/ stops after the first failure, at the start of the string,

      s/failure/success/

      whereas adding the /g would tell the regex engine to keep on trying until it reaches the end of the string.

      What /g does on a regex depends on context. In boolean scalar context it matches once, and stores the position in pos. If you execute it a second time, it starts off from where it left.

      The background is that you can write

      while (/(b*)/g) { ... }

      and get a new match for each iteration:

      $perl -e '$_="abbbabc"; while (/(b*)/g) { print "($1)\n" }' () (bbb) () (b) () ()

      Update: Answer to the second question

      That this code produces two replacements for the string of four 'b's remains a puzzle. Why does this appear (this may be my error) that the regex conflates two 'b's rather than all four?

      A naiive substitution implementation would loop on s/b*/^/, because it would continue to replace the empty string with ^ forever on.

      Perl is a bit more sophisticated: It detects a zero-width match, and before doing a second substition of a zero-width match at the same position it bumps along, and tries in the next position.

      So applying s/b*/^/ on abba make these steps:

      abba | match zero b's before a ^abba | match zero b's again. Don't substitute here, bump along ^abba | match 'bb' ^a^a | match zero b's ^a^^a | match zero b's, don't substitute but bump along ^a^^a^ | match zero b's, don't substitute but bump along

      You can watch it work; I didn't find a way to get the modified string, but at least you can monitor the match positions:

      $ perl -le '$_ = "abba"; s/b*/print pos; "^"/eg; print' 0 1 3 4 ^a^^a^
      Perl 6 - links to (nearly) everything that is Perl 6.
        If you execute it a second time, it starts off from where it left.
        If that where true, while (/(b*)/g) would never finish. It starts where it finished the previous time, unless it matched an empty string. In the latter case, pos() will be advanced by one. (Details are even more complicated. But documented. See the section "Repeated Patterns Matching a Zero-length Substring" in the perlre manual page.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://859294]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (12)
As of 2014-10-21 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (98 votes), past polls