Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Regex optimizations

by Ovid (Cardinal)
on Oct 24, 2000 at 23:05 UTC ( #38175=perlquestion: print w/replies, xml ) Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I'm desperately waiting for the second edition of Mastering Regular Expressions to come out. Until that time, there's a minor regex question that puzzles me.

In MRE, Friedl wrote that \d\d is more efficient than \d{2} because the second forces the re engine to count the occurences (thus incurring extra overhead) and the first simply allows it to match each in turn. However, it seems that this is a simple optimization. Does anyone know if this has happened in the newer versions of the re engine? I feel that \d{2} is more clear and generally use it, unless I am forced to optimize.


Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Replies are listed 'Best First'.
Re: Regex optimizations
by tedv (Pilgrim) on Oct 25, 2000 at 00:14 UTC
    Using 5.005_03, I got almost identical times with the following code: (41 seconds for \d{2} and 40 seconds for \d\d. But this could be a second granularity problem.)

    use strict; speed_test('\d{2}', sub { shift() =~ /\d{2}/ }); speed_test('\d\d', sub { shift() =~ /\d\d/ }); sub speed_test { my ($name, $func) = @_; my $start = time; my $index = 10000000; while ($index-- > 0) { &$func($index); } my $end = time - $start; print "Total time for $name: $end\n"; }

    It just occured to me that the subroutine indirection might be the bottleneck, but when I wrote a new version, the times were also identical. Additionally, I tested other numbers such as \d{5} (and corresponding \d\d\d\d\d) but this had no effect on the timing either.


      Some of you may have wondered why I didn't just benchmark it myself. Because I was being stupid. I looked at tedv's benchmark and couldn't believe I hadn't bothered to check. Blame it on over two-weeks straight of working without a day off :(

      A critical portion of regex optmization is what occurs on failure. I created a string which had at least one failure for the regex and then benchmarked the comparisons:

      #!/usr/bin/perl -w use strict; use Benchmark; my ( $re, $test ); timethese(-30, { regex1 => '$re = "a" . \'\d\'x500 . "a"; $test = ("a"x2000 . "1"x499)x2 . "a" . "1"x500 . ("a"x +2000 . "1"x499)x2 ; $test =~ /$re/o;', regex2 => '$re = "a" . \'\d{500}\' . "a"; $test = ("a"x2000 . "1"x499)x2 . "a" . "1"x500 . ("a"x +2000 . "1"x499)x2 ; $test =~ /$re/o;', });
      Benchmark: running regex1, regex2, each for at least 30 CPU seconds... regex1: 31 wallclock secs (31.31 usr + 0.00 sys = 31.31 CPU) @ 65 +5.10/s (n=20508) regex2: 31 wallclock secs (30.84 usr + 0.00 sys = 30.84 CPU) @ 53 +0.35/s (n=16358)
      After running this a couple of times, I see that \d{2} is less efficient than \d\d, though not by much. If there is terribly convoluted data I am iterating over, though, it could be an issue.

      I still don't see why I didn't benchmark it before asking. Thanks for slapping sense into me, tedv :)


      Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

RE: Regex optimizations
by dchetlin (Friar) on Oct 25, 2000 at 01:13 UTC
    This is a good question. The answer is that the REx engine does not optimize \d{2} into \d\d.

    Here's how to check:

    [~] $ perl -Mre=debug -e'qr/\d{2}/' Freeing REx: `,' Compiling REx `\d{2}' size 4 first at 3 1: CURLY {2,2}(4) 3: DIGIT(0) 4: END(0) stclass `DIGIT' minlen 2 Freeing REx: `\d{2}' [~] $ perl -Mre=debug -e'qr/\d\d/' Freeing REx: `,' Compiling REx `\d\d' size 3 first at 1 1: DIGIT(2) 2: DIGIT(3) 3: END(0) stclass `DIGIT' minlen 2 Freeing REx: `\d\d'

    Why doesn't it? I don't know. But as the Benchmarks above have noted, I can't imagine that difference being notable enough to really worry about.


Re: Regex optimizations
by japhy (Canon) on Oct 26, 2000 at 16:10 UTC
    Agh! I had a response to this! Where'd it go?!

    Sigh, it's too early for this. Geez, and I deleted the benchmark I had on my computer. Well, you'll see it later today, then. I basically has to do with how \d{x,y} is better than \d\d\d\d... when you're using a range like that.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://38175]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2018-01-16 17:30 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (186 votes). Check out past polls.