|XP is just a number|
Re: Why this simple regex freeze my computer? (backtracking)by lodin (Hermit)
|on Dec 15, 2007 at 12:06 UTC||Need Help??|
shmem has already pointed out that there are other probably better ways to solve this.
It's still interesting, though, to ponder why a simple change from * to + can cause the pattern to run for a very long time.
The problem is the last line:
It makes the pattern fail. But the regex engine doesn't give up so easily, and so it tries to backtrack each quantifier in the pattern before that. Your biggest problem here is probably that .+? can match too much. It can pass a <td></td> tag, and then try to match again. For instance, instead of matching the second last, it can skip that and try to match the very last, but it will definately fail that too. Without an ^ anchor the engine can also try to start matching at the second <td></td> tag, and so on. It sums up to an awful lot of backtracking.
The easiest way to solve this is to use the (?>) assertion. It's the "anti-backtrack" assertion. Once the pattern inside the assertion has matched, it won't backtrack. The match is freezed. (The engine may of course backtrack behind the assertion, and then it'll try to match again at the new place.) So a line in your pattern may look like
depending on what you really need. I hope I managed to be somewhat clear and understandable.
If an HTML parser isn't doing the trick for you, you might want to consider using split or a m//g to extract relevant parts, and then handle those separately. That way you easily avoid a lot of backtracking, if the patterns become more sophisticated.