|Perl: the Markov chain saw|
Death to Dot Star!by Ovid (Cardinal)
|on Jul 27, 2000 at 07:27 UTC||Need Help??|
For those who have read some of my past posts, I freely confess to hastily posting bad regexes. I thought I was decent with them, but then I started rereading Mastering Regular Expressions. Yikes! I still have a lot to learn.
This "Discussion" is primarily aimed at those new to regexes, so my apologies if I seem pedantic.
I've seen a number of regexes posted to Perlmonks that use a dot star (.*) combination to slurp up characters. While it is sometimes not possible to avoid this, Perlmonks should try to avoid it like the Inquisition.
Dot star is often used as a "catch all" in regular expressions. Often you'll see a regex like the following:
The intent of this is to capture whatever is inside of parentheses to $1 (this is called backreferencing). However, this fails if $myvar is something like (yes, I know it's a ridiculous example):
We might expect the final line to print hi. It's me, but it won't. This is because the star quantifier is greedy. It will attempt the larget match that can possibly satisfy the regular expression. What would be printed is
To solve this, many programmers will add a question mark after the quantifier (.*?) which makes the quantifier match as little as possible. This is called lazy or non-greedy matching. Merely adding that little question mark will cause the code to print the hi. It's me, that we were looking for:
So we're all set, right? Wrong. This certainly works better than the greedy dot-star combination. It keeps trying to match until it finds the first quote mark and then tries to get the smallest match that satisfies the regex. Sounds fine, right? Well, no.
There are two problems with both the greedy and lazy version of the dot star: imprecision and tracking.
The imprecision is obvious. The example above shows the imprecision with the greedy version of .*, but the lazy version is more subtle. What happens if you were trying to extract questions in quotes without the trailing question mark? You might think that something like /"(.*?)\?"/ would do the trick. Unfortunately, we get the same result as above because the lazy matching doesn't guarantee the smallest match. It guarantees the smallest match from the first place that a match could possibly begin.
Tracking is another problem with both of them. With the greedy version of dot star, the dot star gobbles up the entire string. The regex engine is then forced to backtrack from the end of the string to find the longest possible match that will satisfy the regex. With the lazy version, the the regex engine is forced to stop after every match to see if the rest of the regex will match (kind of like a lookahead, but with subtle differences. See Mastering Regular Expressions, Second Edition, page 226 for the mechanics of this). With a one-shot regex, these issues are not usually much of a performance hit (though it can be nasty on a complicated regex with no possible match), but if you are forced to iterate over the regex, these "tracking" issues can have quite a performance hit on your program.
The solution is to use a negated character class:
What's going on there? When a caret "^" is the first character in a character class, it's telling the regex to match anything except what is in the character class. In this case, it is telling it to keep matching anything (including newline) that is not a quote. If there is a possible match, there is no tracking, there's no ambiguity, there's just a straight match. The above regex will match the first quote, capture everything that's not a quote to $1, and then match the end quote. It's fast and precise.
Unfortunately, it's a little more complicated with the "questions in quotes" example. Here's one solution:
Which breaks out to:
Yes, it's more work. Yes, it's harder to read. But it works and doesn't have the problems of the dot star.
A potential pitfall: Unless you are using the /s modifier, the .* does not match a newline (\n). A negated character class will happily match a newline if you don't include it. If your target text has them, be sure to include them in the negated character class, if appropriate.
If Monks would like to see more stuff like this, let me know.