Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Greedy modifier found to be working non-greedy in a named group

by rkabhi (Acolyte)
on Nov 29, 2019 at 10:09 UTC ( [id://11109425]=perlquestion: print w/replies, xml ) Need Help??

rkabhi has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am new to named groups in Perl regex. I am using perl 5.16.0

Below code works as expected:

$_ = "This is a teeeext for testting"; /(?<char>.*)/ and print "'$+{char}' is matched pattern\n"; # This prints 'This is a teeeext for testting' is matched pattern

Replacing '.' with 'e' in regular expression above leaves matched pattern as blank at the beginning:

$_ = "This is a teeeext for testting"; /(?<char>e*)/ and print "'$+{char}' is matched pattern\n"; # This prints '' is matched pattern

I was expecting output of above code to be => 'eeee' is matched pattern Please enlighten why the output is different? Is the * modifier becoming non-greedy or is there any other way to understand this?

Replies are listed 'Best First'.
Re: Greedy modifier found to be working non-greedy in a named group
by tybalt89 (Monsignor) on Nov 29, 2019 at 10:18 UTC
    $_ = "This is a teeeext for testting"; /(?<char>e*)/ and print "'$+{ch +ar}' is matched pattern at $-[0]\n";

    Outputs:

    '' is matched pattern at 0

    e* matches at the beginning of the string, try e+

      @tybalt89 I have tried e+ already and I know that it works. But e* should also have worked because '*' is a greedy modifier unless it is succeeded by a '?'. So, in my example, both e* and e+ should have given same output. Why is the difference in output?

        The greedy operators still implement a leftmost-longest strategy. That means that the leftmost match will win even if you find a longer match later.

        e* can match zero e. The leftmost place where you can match zero e is at the start of the string.

        A greedy match is the longest representation of the first match. In the case of /e*/ the first match is the start of the string, and is zero characters long because the first character is not an 'e'. In the case of /e+/ the first match starts at the first 'e' and continues until it finds a non-'e' character.

        > both e* and e+ should have given same output. Why is the difference in output?

        Good question, I think many are confused.

        It's important to understand that empty matches exist and that e* means e{0}|e+ ( or something like e{0,32766} ° ).

        So you are actually matching e{0} before all!

        Printing out the match position of a capture group via @+ helps demonstrating it

        DB<1> $_ = "This is a teeeext for testting"; DB<2> ;/(?<char>e*)/ and print "'$+{char}' is matched pattern at p +os $+[0]\n"; '' is matched pattern at pos 0 DB<3> ;/(?<char>e{0})/ and print "'$+{char}' is matched pattern at + pos $+[0]\n"; '' is matched pattern at pos 0 DB<4> ;/(?<char>e+)/ and print "'$+{char}' is matched pattern at p +os $+[0]\n"; 'eeee' is matched pattern at pos 15 DB<5> ;/(?<char>e{1,32766})/ and print "'$+{char}' is matched patt +ern at pos $+[0]\n"; 'eeee' is matched pattern at pos 15 DB<6>

        update
        • Actually $+[0] gives you the end of the first match, $-[0] will give you the start.
        • e* is not limited in my Perl version but the upper bound in e{,} is.
        • couldn't find positions of named groups

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        °) yes there is an maximal upper bound in range quantifiers, which I expected to be around 2**16, so it's an incomplete analogy

        Thanks, its clear now.
Re: Greedy modifier found to be working non-greedy in a named group
by rkabhi (Acolyte) on Nov 29, 2019 at 12:52 UTC

    To conclude, the explanation of leftmost first and then longest will be valid even if I didn't use named groups.

    So, for example,

    $_ = "This is a teeeext for testting"; /(e*)/ and print "'$1' is matched pattern\n"; # gives '' is matched pattern
    and
    $_ = "This is a teeeext for testting"; /(e+)/ and print "'$1' is matched pattern\n"; # gives 'eeee' is matched pattern

    So, this behavior is generic.

      the explanation of leftmost first and then longest will be valid even if I didn't use named groups. So, this behavior is generic.

      Yes, perlre says:

      In Perl the groups are numbered sequentially regardless of being named or not. Thus in the pattern
      /(x)(?<foo>y)(z)/
      $+{foo} will be the same as $2, and $3 will contain 'z'

      And the description of (?|pattern) says:

      Named captures are implemented as being aliases to numbered groups holding the captures

      Also, I remember finding the description of the left-to-right operation of the regex engine in the Camel quite enlightening.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109425]
Approved by LanX
Front-paged by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-24 02:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found