Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Why this simple regex freeze my computer? (backtracking)

by lodin (Hermit)
on Dec 15, 2007 at 12:06 UTC ( #657187=note: print w/replies, xml ) Need Help??


in reply to Why this simple regex freeze my computer?

shmem has already pointed out that there are other probably better ways to solve this.

It's still interesting, though, to ponder why a simple change from * to + can cause the pattern to run for a very long time.

The problem is the last line:

<td class='list_cell' valign='middle' style="font-size:120%;"></td>
It makes the pattern fail. But the regex engine doesn't give up so easily, and so it tries to backtrack each quantifier in the pattern before that. Your biggest problem here is probably that .+? can match too much. It can pass a <td></td> tag, and then try to match again. For instance, instead of matching the second last, it can skip that and try to match the very last, but it will definately fail that too. Without an ^ anchor the engine can also try to start matching at the second <td></td> tag, and so on. It sums up to an awful lot of backtracking.

The easiest way to solve this is to use the (?>) assertion. It's the "anti-backtrack" assertion. Once the pattern inside the assertion has matched, it won't backtrack. The match is freezed. (The engine may of course backtrack behind the assertion, and then it'll try to match again at the new place.) So a line in your pattern may look like

(?><td[^>]*>\s*(.+?)\s*</td>\s*)
depending on what you really need. I hope I managed to be somewhat clear and understandable.

If an HTML parser isn't doing the trick for you, you might want to consider using split or a m//g to extract relevant parts, and then handle those separately. That way you easily avoid a lot of backtracking, if the patterns become more sophisticated.

lodin

Replies are listed 'Best First'.
Re^2: Why this simple regex freeze my computer? (backtracking)
by shmem (Chancellor) on Dec 15, 2007 at 12:25 UTC
    ++Nice explanation, and another reason why you shouldn't use regexes to parse HTML. One single error (or one single change in the html) can freeze your box and spoil your carefully crafted regex. As matija very succinctly said,
    By the time you've resolved all those problems, you've written the better part of a HTML parser.

    Of course you learn a lot trying ;-)

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://657187]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2018-01-21 11:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (227 votes). Check out past polls.

    Notices?