Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

How Greedy is Greedy: A Regex Question

by dsb (Chaplain)
on Jul 03, 2001 at 22:26 UTC ( [id://93618]=perlquestion: print w/replies, xml ) Need Help??

dsb has asked for the wisdom of the Perl Monks concerning the following question:

I've read in several places that the '.*' construct is not the most efficient way to grab multiple characters because of its greediness, or its first directive to match as much as possible, only matching less when there is a sub-expression further in the regex that is required to match and can't.

That I understand. My question is this:
If there are two of a similar pattern in a string will the '.*' match both of them if the terminating sequence is the same?

For example:
Say the string to matched against is:
<a href="http://www.perlmonks.org">Perlmonks</a> is a great site, so is <a href="http://www.devshed.com">DevShed</a>

And say(for arguments sake, I would not use this regular expression if not for the sake of this question) that the regex used was:

$string =~ m%<a href="([^"]+)">(.*)</a>%;
So I wondered, would the '.*' construct in this case match only the text from the first link tag, or would it match everything from the end of the first URL up to the closing tag of the second link? So that $2 would contain:
Perlmonks</a> is a great site, so is <a href="http://www.devshed.com">DevShed

The answer is yes(I tried it). But when the '.*?' construct is used, only 'Perlmonks' is matched. So the question mark apparently makes the '.*' construct non-greedy.

So is it the optional nature of '?' that makes the '.*' construct non-greedy, or is it something else entirely?

Amel - f.k.a. - kel

Replies are listed 'Best First'.
Re: How Greedy is Greedy: A Regex Question
by tachyon (Chancellor) on Jul 03, 2001 at 22:33 UTC

    You are correct the x* modifier means as much of x as possible to make a match. x*? means as little of x as possible including no x at all. It is similar for x+ and x+? except x+? must have at least one x. See perlman:perlre for more details. Here is a relevant chunk:

    The following standard quantifiers are recognized: 
    
    
        *      Match 0 or more times
        +      Match 1 or more times
        ?      Match 1 or 0 times
        {n}    Match exactly n times
        {n,}   Match at least n times
        {n,m}  Match at least n but not more than m times
    
    (If a curly bracket occurs in any other context, it is
    treated as a regular character.) The ``*'' modifier is
    equivalent to {0,}, the ``+'' modifier to {1,}, and the
     ``?'' modifier to {0,1}. n and m are limited to integral
    values less than 65536. 
    
    By default, a quantified subpattern is ``greedy'', that is,
    it will match as many times as possible (given a particular
    starting location) while still allowing the rest of the
    pattern to match. If you want it to match the minimum
    number of times possible, follow the quantifier with a
    ``?''. Note that the meanings don't change, just the
    ``greediness'': 
    
    
        *?     Match 0 or more times
        +?     Match 1 or more times
        ??     Match 0 or 1 time
        {n}?   Match exactly n times
        {n,}?  Match at least n times
        {n,m}? Match at least n but not more than m times
    
    
    

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n\w+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: How Greedy is Greedy: A Regex Question
by japhy (Canon) on Jul 04, 2001 at 01:42 UTC
    Non-greediness is something you really have to understand. It doesn't mean "shortest", just like "greedy" doesn't mean "longest". It just means the engine would rather not match characters, given the choice.

    Read chapter 6 of my book for some good examples. (Let me know if they're good examples.)

    japhy -- Perl and Regex Hacker
Re: How Greedy is Greedy: A Regex Question
by particle (Vicar) on Jul 03, 2001 at 22:39 UTC
    '?' makes things non-greedy. check perlman:perlre for a more complete explaination, and more interesting modifiers....

    ~Particle

Re: How Greedy is Greedy: A Regex Question
by clemburg (Curate) on Jul 04, 2001 at 12:23 UTC
Re: How Greedy is Greedy: A Regex Question
by Spenser (Friar) on Mar 18, 2007 at 23:22 UTC
    The practical part of this posting doesn't seem to be addressed. In the example data, the second hyperlink (i.e., the <a href...>...</a>) is lost by the expression used. Here they are again:
    <a href="http://www.perlmonks.org">Perlmonks</a> is a great site, so i +s <a href="http://www.devshed.com">DevShed</a> $string =~ m%<a href="([^"]+)">(.*)</a>%;
    The original poster complains that when he uses (.*?) for the second parentheses, it doesn't work properly. What should be used is (.?) without the asterisk.

    -Spenser

    That's Spenser, with an "s" like the detective.

      Actually, i wasn't complaining at all. Rather inquiring as to what it is about the ? that makes the expression non-greedy back when I was still first tackling regexes. That said, I fail to see what .? would accomplish. It would match 0 or 1 non-newline characters, which has nothing to do with the OP. Perhaps you can expand on you thinking?


      dsb
      This @ISA my( $cool ) %SIG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://93618]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-20 02:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found