Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How will my regular expression match?

by hv (Prior)
on Jul 04, 2004 at 16:57 UTC ( [id://371722]=perltutorial: print w/replies, xml ) Need Help??

A common source of confusion is the issue of which possible match the regular expression will find in any given instance, and this gives rise to many queries on SOPW and many different attempts to explain what happens.

A common theme in the explanations is to talk about preferences - the regular expression engine "prefers" longer or shorter matches, or leftmost matches.

The reality is however very simple: the regular expression engine will always return the first match it finds, and if you understand the order in which the engine attempts matches, you need never be confused about which match will be found.

A few simple rules (in no particular order) characterise the order in which attempts are made.

1: leftmost character first

The engine will start trying to match the pattern against the string, starting with the first character of the string. Only if it cannot find any match against the beginning of the string will it start looking for matches starting with the second character of the string, and so on until the string is exhausted.

Example:

($match) = "The longest word" =~ /(\w+)/; print $match;
.. will print "The", since it matches at the beginning of the string. The fact that there is a longer match to find is irrelevant - a match against the beginning of the string succeeded, so this is the match that is returned.

2: leftmost alternation first

Given a choice of alternates, the leftmost alternate is tried first. If a match is found against the leftmost alternate, the engine will attempt to match the rest of the pattern against the rest of the string; only if no match can be found will the next alternate be tried.

Example:

($match) = "Mrs Smith" =~ /(Mr|Mrs)/; print $match;
.. will print "Mr": the first alternate matched, and thus satisfied the entire pattern, so this is the match returned. The second alternate was never looked at.

3. Maximal (greedy)

With any normal quantifier (ie any of ? * + {1,3}), the engine will first try to match the greatest number of iterations permitted (or the most available if that is fewer than permitted), and try to match the rest of the pattern against the rest of the string. Only if the rest of the pattern fails to match will it try matching one fewer iteration, and repeat until the minimum permitted number is reached.

Example:

($match) = "/foo/bar" =~ m[.*/(.*)]; print $match;
.. prints "bar": the engine tries to match the initial .* against "/foo/bar", then "/foo/ba", then "/foo/b", then "/foo/", then "/foo". At this point the remaining pattern matches, so this is the match returned.

4. Minimal (non-greedy)

Adding a '?' after a quantifier (ie ?? *? +? {1,3}?) requests a minimal match, so the order is reversed: the engine first attempts to match the fewest number of iterations permitted and if it can, attempts to match the rest of the pattern against the rest of the string. Only if that fails does it try matching one more iteration, and continues increasing the number of iterations matched until the upper limit is reached.

Example:

($match) = "foo/bar/baz" =~ m[.*?/(.*)]; print $match;
.. prints "bar/baz": the engine tries to match the initial .* against "", then "f", then "fo", then "foo". At this point the rest of the pattern matches, so this is the match that is returned.

5. Pattern components: left to right

Implicit in some of the above examples is that the components of the pattern are matched from left to right. So in the 'maximal' example above, the engine decides first (based on the rules above) how it is going to try to match the initial .*, and only after that is fixed does it start worrying about the remaining components of the pattern.

6. Nesting

This is really just more of the "left to right" rule, best explained by example:

($outer, $inner) = "/foo/bar" =~ m[(/(.+))*]; print $inner;
.. prints "foo/bar". The engine reaches the outer quantifier, knows it should match it as many times as possible, and attempts to match its innards. The first iteration of the innards reaches the nested quantifier, and tries to match it as many times as possible, giving "foo/bar". At this point, the outer quantifier cannot be matched a second time but that's fine - the pattern has been fully satisfied, and so this match is returned.

That covers most of the standard cases. There are a few other possibilities to consider (eg anchors, pos(), and the various zero-width assertions). Also, the optimiser may sometimes cause the engine to skip some steps, but only if it is sure that the actual match found will be the same.

I hope this will help you work out which match your regular expression will find.

Hugo

Replies are listed 'Best First'.
Re: How will my regular expression match?
by dsb (Chaplain) on Jul 06, 2004 at 12:15 UTC
    Hey, hv....

    I think this a great short explanation of regex engine expectations. The only part I might tweak is the part that addresses the mostly greedy nature of the Perl engine.

    Your explanation is right on with regards to the way the engine behaves. You may, however, want to change the term iterations to something else. Iterations usually refers to the repetition of a series of operations. So, while the process of attempting to match against each character could be called an iteration of the regex engine, the characters could really not.

    I'm not trying to be overly critical or anything. This tutorial would be great for new (Perl|regex) users to read and i guess I just think the use of iterations might be a bit confusing to them.


    dsb
    This is my cool %SIG
      Iterations usually refers to the repetition of a series of operations.

      Hmm, I was thinking of the process of matching "the thing quantified" once as the series of operations being repeated. But I don't have a good name for "the thing quantified". I agree that 'iterations' isn't ideal, but I'm not sure what would be better.

      I'm not trying to be overly critical or anything.

      Please, do be. :)

      Hugo

        I don't have a good name for "the thing quantified".
        I suggest "repetitions".

        We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: How will my regular expression match?
by dpavlin (Friar) on Jul 11, 2004 at 00:09 UTC
    Allthough, this is repetitive, using one-liners like:
    perl -e "use re 'debug'; print 'this is a demo' =~ /\w+/;"
    helped me more than once (it's even more useful with complex regexes :-). It'a a shame that you have to dig into perldebug to find it. I think it should be mentioned in perlre also.

    2share!2flame...

      I use this a lot, usually with the shorter command-line invocation ('-Dr', available if perl was built with debugging enabled). However the output was designed primarily for internals debugging, and is likely to be more scary than helpful to the casual user.

      Various people have taken advantage of the information it provides to offer a friendlier interface to regexp debugging. Though I haven't tried any of them, I'm aware of ReBug (I believe this is open source, but I can't get to the website: the author gave a talk on it at the 2001 OSCON), Komodo (commercial, with "try before you buy"), and The Regex Coach ("donationware", donate if you like it) which was recently mentioned on PM.

      Hugo

      There's also the shorter, perl -Mre=debug -e ' ... ' for those people without a debugging perl.
Re: How will my regular expression match?
by vasundhar (Acolyte) on Feb 11, 2010 at 19:32 UTC
    Hugo, Thank you very much for sharing. I guess this is the best explanation. Short, Straight to the point.
Re: How will my regular expression match?
by Anonymous Monk on Jan 09, 2007 at 02:42 UTC
    Hi Though not directly related to the explanation here, I think this is the most relevant place. There are 6 Patterm Matching Rules explained for Perl Engine (Programming Perl by Larry, Tom and Randal). But in Rule 3 what is the meaning of the Left to Right Pecking order as I understand that each and every item (Assertion or Atom) of the current RegEx Alternative is searched sequentially in the String from the current position onwards. So in case the current item does not match at current position in the string, what does it mean to try the next higher pecking order items. Thanx Manish Sood manishsood14@yahoo.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://371722]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-03-19 03:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found