|Perl: the Markov chain saw|
Benchmarks aren't everythingby tilly (Archbishop)
|on Oct 24, 2005 at 08:01 UTC||Need Help??|
While it is nice to win benchmarks, sometimes other things are more important. Such as not halting people's programs.
I was reminded of this recently when I ran across a benchmark showing that Perl's regex engine was a lot slower than Java's. The benchmark was at http://www.tbray.org/ongoing/When/200x/2004/08/22/PJre. I sent Tim Bray (the author) a note explaining why Perl's engine was slower and why this is a Good Thing™, and he said that if I wrote it up somewhere, then he would point to it.
This is that write up.
The short version of the story is that Perl has accepted a significant performance penalty to reduce the odds that programs will accidentally get "stuck" and never finish a match. Whenever people write benchmarks, they only test the common case and so see the significant performance penalty. They don't see the bug reports that the change avoids.
The slightly more detailed version is that starting with Perl 5.6, Perl's RE engine avoids many of the exponential slowdowns that Mastering Regular Expressions talks about, but at the cost of tracking a lot of extra data. This fixes a lot of hard to find bugs at the cost of losing performance benchmarks.
The long version follows.
A lot of what follows will be very familiar for people who understand Mastering Regular Expressions. After all one of the biggest reasons for writing that book was to help more people understand the problem that I'm about to describe, and understand how to avoid it.
Suppose we have a pattern, like /^(\s*foo\s*)*$/ and we want to try to match it to a string like "foo foo fo". How do we do that?
Well there turn out to be a couple of strategies, but the one which virtually everyone uses (read the book for what the other one is, and why people use this) can be described as follows: do a recursive search for ways to try to match the pattern to the string. That is, the sets of combinations of choices that one might make in trying to match the pattern to the string (should this .* include the next character?) form a tree that you can do a recursive search through, looking for a match.
An example will help show what I mean, so let's try to match the above RE to a string that it succeeds with, "foo foo". I'll underline the matched part at each step, and put parens to indicate what choices it has made:
# It is unambiguous how to match to here: (foo foo # Let's try putting the space in the first group: (foo )foo # Proceeding on we get to this match: (foo )(foo)</code>There is another possible match, but the above is the one we will find, as may be verified by looking in $1.
Now let's try an example that fails, namely "foo foo fo":
# It is unambiguous how to match to here: (foo foo fo # Let's try putting the space in the first group: (foo )foo fo # It is unambiguous how to match to here: (foo )(foo fo # Let's try putting the space in the second group: (foo )(foo )fo # FAIL (foo )(foo )(fo # Perhaps don't put the space in the second group? (foo )(foo) fo # FAIL (foo )(foo )(fo # Perhaps don't put the space in the first group? (foo) foo fo # It is unambiguous how to match to here: (foo)( foo fo # Let's try putting the space in the second group: (foo)( foo )fo # FAIL (foo)( foo )(fo # Perhaps don't put the space in the second group? (foo)( foo) fo # FAIL (foo)( foo)( fo # Oops, all options failed, it doesn't match.A careful observer may note a couple of things. First of all this clearly works, the algorithm is not very hard to follow and clearly must find an answer. Secondly it seems to take a lot less work to find a match than it does to prove that there is no match.
A very clever observer might notice that the first observation is wrong, and the second is worse than it seems in this example.
How many times do you reach the end and FAIL if you try to match "foo foo foo fo"? Well the brute force solution will try every combination of ways to put the first space in the first group or not, the second in the second group or not, and the third in the third group or not. So we go from 4 to 8.
If you try to match "foo foo foo foo fo" then you add one more choice, so you go from 8 to 16.
If you try to match the string ("foo " x 100) . "fo" then it will FAIL 2**100 times, which is longer than the expected lifetime of our planet and definitely longer than the expected lifetime of your computer, let alone your program.
Which is what the really clever observer noticed. That failing takes more work, sometimes so much so that it can't actually finish!
Now in some sense this is unavoidable. As http://perl.plover.com/NPC/ notes, figuring out when a Perl regular expression matches a string is an NP-hard problem. Therefore there must be exponential failure modes. Life sucks. Sorry.
But what Ilya Zakharevich realized is that while it is true that there must be exponential slowdowns, it is not true that we must make them be so easy to hit. Very few actual bug reports use backreferences, or any other feature that requires matches to be slow. So why can't they be sped up?
Now to be fair, he was not the first with this revelation. Most decent RE engines have logic in place to avoid some common instances of this problem. But his solution was novel.
What is the problem we are trying to solve? The problem is that we wind up revisiting sets of choices which can't possibly make a difference about where we match. For instance in the example above, who cares which grouping each space goes in? It obviously won't affect whether you match. What we would like is some way to say, "I've tried this reasoning before, and it didn't work that time so don't try it again."
Ilya's answer for how to do this was quite simple (though as that link proves, the implementation was not quite so simple nor was the initial try bug free). His idea is that you could track what combinations (position in REx, position in string) you had visited. If you encounter a combination that you have been to before, then you'll get the answer that you got before, which must have been failure because you're here again! OK, you can't always do this, there are constructs (eg backreferences) that ruin your reasoning (because what you put into a grouping might affect whether you match), but this certainly covers most common REs.
Let's try this optimized version on the failure case above. Here is what happens:
# It is unambiguous how to match to here: (foo foo fo # Let's try putting the space in the first group: (foo )foo fo # It is unambiguous how to match to here: (foo )(foo fo # Let's try putting the space in the second group: (foo )(foo )fo # FAIL (foo )(foo )(fo # Perhaps don't put the space in the second group? (foo )(foo) fo # FAIL (been here before) (foo )(foo )(fo # Perhaps don't put the space in the first group? (foo) foo fo # FAIL (been here before) (foo)( foo fo # Oops, all options failed, it doesn't match.The amount of reasoning just went down, and we only have to FAIL 3 times, not 4. For the string "foo foo foo fo" we FAIL 4 times instead of 8. And "foo " x 100) . "fo" will fail 101 instead of 2**100 times, so will actually finish!
The cost, unfortunately, is that we need to have a vector of bits in which we track what combinations we've been to. Just allocating that space and flipping bits is significant overhead on the success case. So we get rid of bug reports but lose benchmarks.
UPDATE: blakem is correct, I was missing a * in my regex. That's what I get for posting after midnight after a long day. Fixed.