http://www.perlmonks.org?node_id=149357

Aristotle has asked for the wisdom of the Perl Monks concerning the following question:

I just read the section on regexes in Camel 3 a few days ago and now I understand why Ovid posted Death to Dot Star! many moons ago. Now, I was trying to come up with a universal way to express a backtracking free vesion of a regex. Personally, I almost never use .* alone - it's .*? in 99.9% of my regexes. What I'm looking for is a way to convert a .*?xyz into a backtracking free construct where xyz may be an arbitrarily long string.

So let's say I have /(.*?)\send;/. The "simple" way I came up with is /((?!\send;).)*\send;/. But, that's still a dot metacharacter and besides, to me it doesn't seem to me to really do any less work than the .*? (it just reverses the order of work by testing for \send; first before eating the next character, where the engine would eat the character and then test). Obviously that's not what I'm after. My next idea was /(\S|(?!\send;)\s)*\send;/. Is that as much as I can hope for?

This generalizes to testing for whether we have either the negated first character out of the constant string or not the constant string in a lookahead in front of the first character out of the string. The savings is when we can eat a character due to the first case (not equal to the first character in the string) because we don't need any string comparison and can look at just the character in question. A subtle difference as noted by Ovid in his node is that a negated character class will eat newlines, which the dot would not usually do.

Assuming I haven't made any grave mistakes or flawed assumptions, then forgive my hubris but shouldn't the regex engine be capable of optimizing this case without further teeth clenching on the programmer's side since every .*?xyz generalizes to (^x|(?!xyz)x)*xyz? (In fact, I don't see why it shouldn't work for arbitrary regexes in place of just a constant xyz string.)

Can someone enlighten me?

Makeshifts last the longest.

Replies are listed 'Best First'.
Re (tilly) 1: Avoiding regex backtracking
by tilly (Archbishop) on Mar 05, 2002 at 14:00 UTC
    In this case there should be no win to trying to optimize. The RE engine is smart enough to find a much better optimization behind your back. (Find the fixed string through Boyer-Moore, match the RE around that optimization.)

    In general while backtracking is good to know about, most of the time it is not a problem. The exceptions are cases like this:

    foreach my $i (1..40) { slow_match($i); } sub slow_match { my $count = shift || 20; my $str = ("yada " x $count) . "yad"; print "Trying to match $count iterations.."; die "Huh?" if $str =~ /^(\s*yada\s*)*$/; print "Done\n"; }
    As for the optimization you mention, for ones which do not use "special features" (eg backreferences within the match, lookaheads, etc) it is possible to execute any match in guaranteed time. Perl does not, however, do this...
      This snippet shows very different behavior in 5.6.1 and 5.00503. In the earlier version I see the exponential execution time, but in 5.6.1 it seems to do *much* better. Do you know of any improvements in perl that might account for this?

      -Blake

        Yes.

        In 5.6 the engine keeps track of how long "complex RE's" are taking to match. If it is too long, then it redoes it while keeping a history of visited positions. This will reduce the worst case scenario from exponential to polynomial (unless, of course, you use backreferences).

        There is, however, an optimization that I know of which would turn this into straight linear behaviour. I gave a basic sketch of the idea at Re (tilly) 1: Research ideas, but to the best of my knowledge nobody has ever implemented it in any real RE engine...

•Re: Avoiding regex backtracking
by merlyn (Sage) on Mar 05, 2002 at 13:56 UTC
    From the Fine Manual (perlre):
    "(?>pattern)" WARNING: This extended regular expression fea- ture is considered highly experimental, and may be changed or deleted without notice. An "independent" subexpression, one which matches the substring that a standalone "pat- tern" would match if anchored at the given posi- tion, and it matches nothing other than this substring. This construct is useful for opti- mizations of what would otherwise be "eternal" matches, because it will not backtrack (see the section on "Backtracking"). It may also be use- ful in places where the "grab all you can, and do not give anything back" semantic is desirable. For example: "^(?>a*)ab" will never match, since "(?>a*)" (anchored at the beginning of string, as above) will match all characters "a" at the beginning of string, leaving no "a" for "ab" to match. In contrast, "a*ab" will match the same as "a+b", since the match of the subgroup "a*" is influenced by the following group "ab" (see the section on "Backtracking"). In particular, "a*" inside "a*ab" will match fewer characters than a standalone "a*", since this makes the tail match. An effect similar to "(?>pattern)" may be achieved by writing "(?=(pattern))\1". This matches the same substring as a standalone "a+", and the following "\1" eats the matched string; it therefore makes a zero-length assertion into an analogue of "(?>...)". (The difference between these two constructs is that the second one uses a capturing group, thus shifting ordi- nals of backreferences in the rest of a regular expression.) ...

    -- Randal L. Schwartz, Perl hacker

Re: Avoiding regex backtracking
by japhy (Canon) on Mar 05, 2002 at 17:07 UTC
    You are correct that the pattern .*?foo can be written as (?:[^f]+|f(?!oo))*foo, or the variant you showed. You are correct in assuming that this general principle can be used for practically any regex.

    However, you should "unroll the loop" of that regex. Look at these results:

    use Benchmark 'cmpthese'; my $str = "this is the best end"; cmpthese(-5, { extra => sub { $str =~ /(?:[^e]+|e(?!nd))*end/ }, plain => sub { $str =~ /.*?end/ }, }); __END__ extra: 12019.37/s (n=60818) plain: 49615.59/s (n=283305) Rate extra plain extra 12019/s -- -76% plain 49616/s 313% --
    If we unroll the loop, we get better results.
    use Benchmark 'cmpthese'; my $str = "this is the best end"; cmpthese(-5, { extra => sub { $str =~ /(?:[^e]+|e(?!nd))*end/ }, plain => sub { $str =~ /.*?end/ }, uroll => sub { $str =~ /[^e]*(?:e(?!nd)[^e]*)*end/ }, }); __END__ extra: 11711.62/s (n=61486) plain: 49329.33/s (n=250593) uroll: 17985.61/s (n=94964) Rate extra uroll plain extra 11712/s -- -35% -76% uroll 17986/s 54% -- -64% plain 49329/s 321% 174% --
    But the point is, the internal optimization uses less regex op-codes, and so it is faster -- it has the engine do the work itself, it doesn't make the regex do the work.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a (from-home) job
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      I did in fact expect the dot-star-free version to eat up a lot of its "savings" due to the much higher complexity. I was also pretty sure that the regex engine would be using some special opimization for this case as it's so easy to spot. I just really should get into the habit of Benchmarking..

      Cheers everyone for very helpful replies.

      But basically that leaves me at square one: so when should I avoid dot star at least as a rule of thumb? What kind of construct should make me wary at first sight, what could be a red flag (as in the Perl.com program repair shop series) to look for? I am beginning to see a pattern (pun intended) I think - all examples I can recall off the top of my head use of several dot stars, causing excessive backtracking. Does that sound right?

      Makeshifts last the longest.

Re: Avoiding regex backtracking
by erikharrison (Deacon) on Mar 05, 2002 at 15:32 UTC

    I've never actually used it, so I may be shooting myself in the back here, but at http://perl.plover.com/Regex/ is regex.pm, a teaching module which uses a DFA algorithm which (or course) doesn't backtrack - ever. DFA's are faster than NFA's but we don't get a lot of neat stuff. I avoid using the package because I trust Perl to perform losts of complex optimizations behind my back, using math I don't understand, most of the time. For this reason it has been suggested that Perl's regexes be renamed IRregular expressions, because Perl uses a mish-mash of algorithms to optimize for the common case.

    If you use Regex.pm /tell me about it. I'd love to know how it went.

    Cheers,
    Erik