Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

slow regular expressions

by jesuashok (Curate)
on Aug 08, 2006 at 04:34 UTC ( #566070=perlquestion: print w/replies, xml ) Need Help??
jesuashok has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Today morning I was a reading Re: Timeout alarm for regex, by wfsp. I am very curious to know why the below regex is slow.

$_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/; print "Hello\n";

Could anyone explain me that regex in all perspective ?

Update:Code is added.

"Keep pouring your ideas"

Replies are listed 'Best First'.
Re: slow regular expressions
by ikegami (Pope) on Aug 08, 2006 at 05:18 UTC
    It attemps to match and backtracks when it can't find a match. Here's a sample of the matches (one per line) the regexp engine attempts.
    1) a*->"a"x61, a*->"", a*->"", a*->"", ... a*->"", [b]->fail 2) a*->"a"x60, a*->"a", a*->"", a*->"", ... a*->"", [b]->fail 3) a*->"a"x60, a*->"", a*->"a", a*->"", ... a*->"", [b]->fail 4) a*->"a"x60, a*->"", a*->"", a*->"a", ... a*->"", [b]->fail . . . 32) a*->"a"x60, a*->"", a*->"", a*->"", ... a*->"a", [b]->fail 33) a*->"a"x60, a*->"", a*->"", a*->"", ... a*->"", [b]->fail 34) a*->"a"x59, a*->"aa", a*->"", a*->"", ... a*->"", [b]->fail 35) a*->"a"x59, a*->"a", a*->"a", a*->"", ... a*->"", [b]->fail 36) a*->"a"x59, a*->"a", a*->"", a*->"a", ... a*->"", [b]->fail . . . 64) a*->"a"x59, a*->"a", a*->"", a*->"", ... a*->"a", [b]->fail 65) a*->"a"x59, a*->"a", a*->"", a*->"", ... a*->"", [b]->fail 66) a*->"a"x59, a*->"", a*->"aa", a*->"", ... a*->"", [b]->fail 67) a*->"a"x59, a*->"", a*->"a", a*->"a", ... a*->"", [b]->fail . . . 95) a*->"a"x59, a*->"", a*->"a", a*->"", ... a*->"a", [b]->fail . . .

    The character class [b] is necessary because a plain b would trigger optimizations in the regexp engine. The optimization would cause it to scan for the location of b chars in the string to match. Finding none, it would exit immediately.

Re: slow regular expressions
by blokhead (Monsignor) on Aug 08, 2006 at 05:37 UTC
    As GrandFather said, your regex is equivalent to /a*b/, so it's easy to see that the regex can never match. The slowdown is because of the backtracking done by Perl's regex engine.

    Let's look at a smaller example:

    "aaaa" =~ /a*a*b/
    Except I'll make the following changes:
    • Instead of /b/ at the end, I'll use /(?!)/, which does the same thing here (it is a regex that always fails, just like the /b/ will always fail to match anywhere in this particular string).1
    • I'll capture both /a*/'s so I can look at what the regex engine puts in them
    • Before it gets to the end where it always fails (the /(?!)/ part), I'll have it print what parts of the string it tried matching to the /a*/'s.
    The result is this:
    "aaaa" =~ m/ (a*) (a*) (?{ print "Tried $`($1)($2)\n"; }) (?!) /x;
    The output:
    Tried (aaaa)() Tried (aaa)(a) Tried (aaa)() ... ... Tried aaa()(a) Tried aaa()() Tried aaaa()()
    So you can see that it tries a way to fill $1 and $2 (and $`), fails to match /(?!)/, backtracks, and tries again repeatedly. It tries every possible way before the regex match operation finally returns with a failure. In this small example it tries 35 combinations before eventually failing.

    Now if you make the string a lot longer, and give it more ways for the prefixes to match (more /a*/ regexes to fill), it will take a whole lot longer. By my calculations, the regex you gave above will try

    24670925422945900903156716 (= 2e25)
    combinations before eventually failing. Even at a billion combinations per second, that's still 782 million years ;)

    1: If I left /b/ as the last part of the regex instead of /(?!)/, then the regex engine seemed to optimized away my print statement.

    Update: revised my big number calculation, not that it makes a huge difference. It's 61+33 choose 33 if you're curious. That's the number of ways to put 61 balls (the a's) into 34 ordered bins (32 /a*/ regexes, plus $` and $').


Re: slow regular expressions
by GrandFather (Sage) on Aug 08, 2006 at 05:13 UTC

    That is a veru inefficient version of /a*b/ which is likely to do a vast amount of backtracking!

    Update: interesting that if you remove the character class it goes very much faster. It forces backtracking where the non-character class variant manages to fail immediately.

    DWIM is Perl's answer to Gödel
Re: slow regular expressions
by gellyfish (Monsignor) on Aug 08, 2006 at 07:54 UTC

    To get an idea of what and how much work is being done to evaluate the regular expression you can stick use re 'debug'; at the beginning of your code. See the re manpage for more information.


Re: slow regular expressions
by Nevtlathiel (Friar) on Aug 08, 2006 at 09:17 UTC
    tilly wrote a very good article about how you can have this type of regex terminate before the heat death of the universe by using backtracking and how it results in a relative slow down of other regular expressions: Benchmarks aren't everything

    "Write a wise saying and your name will live forever." - Anonymous

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://566070]
Approved by McDarren
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2017-02-20 07:14 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (294 votes). Check out past polls.