Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Why this simple regex freeze my computer? (backtracking)

by lodin (Hermit)
on Dec 15, 2007 at 12:06 UTC ( #657187=note: print w/replies, xml ) Need Help??

in reply to Why this simple regex freeze my computer?

shmem has already pointed out that there are other probably better ways to solve this.

It's still interesting, though, to ponder why a simple change from * to + can cause the pattern to run for a very long time.

The problem is the last line:

<td class='list_cell' valign='middle' style="font-size:120%;"></td>
It makes the pattern fail. But the regex engine doesn't give up so easily, and so it tries to backtrack each quantifier in the pattern before that. Your biggest problem here is probably that .+? can match too much. It can pass a <td></td> tag, and then try to match again. For instance, instead of matching the second last, it can skip that and try to match the very last, but it will definately fail that too. Without an ^ anchor the engine can also try to start matching at the second <td></td> tag, and so on. It sums up to an awful lot of backtracking.

The easiest way to solve this is to use the (?>) assertion. It's the "anti-backtrack" assertion. Once the pattern inside the assertion has matched, it won't backtrack. The match is freezed. (The engine may of course backtrack behind the assertion, and then it'll try to match again at the new place.) So a line in your pattern may look like

depending on what you really need. I hope I managed to be somewhat clear and understandable.

If an HTML parser isn't doing the trick for you, you might want to consider using split or a m//g to extract relevant parts, and then handle those separately. That way you easily avoid a lot of backtracking, if the patterns become more sophisticated.


Replies are listed 'Best First'.
Re^2: Why this simple regex freeze my computer? (backtracking)
by shmem (Chancellor) on Dec 15, 2007 at 12:25 UTC
    ++Nice explanation, and another reason why you shouldn't use regexes to parse HTML. One single error (or one single change in the html) can freeze your box and spoil your carefully crafted regex. As matija very succinctly said,
    By the time you've resolved all those problems, you've written the better part of a HTML parser.

    Of course you learn a lot trying ;-)


    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://657187]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2018-07-23 14:11 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (469 votes). Check out past polls.