Regex optimizations

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I'm desperately waiting for the second edition of Mastering Regular Expressions to come out. Until that time, there's a minor regex question that puzzles me.

In MRE, Friedl wrote that \d\d is more efficient than \d{2} because the second forces the re engine to count the occurences (thus incurring extra overhead) and the first simply allows it to match each in turn. However, it seems that this is a simple optimization. Does anyone know if this has happened in the newer versions of the re engine? I feel that \d{2} is more clear and generally use it, unless I am forced to optimize.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Comment on Regex optimizations

Replies are listed 'Best First'.
Re: Regex optimizations by tedv (Pilgrim) on Oct 25, 2000 at 00:14 UTC
Using 5.005_03, I got almost identical times with the following code: (41 seconds for `\d{2}` and 40 seconds for `\d\d`. But this could be a second granularity problem.) `use strict; speed_test('\d{2}', sub { shift() =~ /\d{2}/ }); speed_test('\d\d', sub { shift() =~ /\d\d/ }); sub speed_test { my ($name, $func) = @_; my $start = time; my $index = 10000000; while ($index-- > 0) { &$func($index); } my $end = time - $start; print "Total time for $name: $end\n"; }` [download] It just occured to me that the subroutine indirection might be the bottleneck, but when I wrote a new version, the times were also identical. Additionally, I tested other numbers such as `\d{5}` (and corresponding `\d\d\d\d\d`) but this had no effect on the timing either. YMMV. -Ted	[reply] [d/l] [select]
(Ovid) RE(2): Regex optimizations by Ovid (Cardinal) on Oct 25, 2000 at 01:16 UTC
Some of you may have wondered why I didn't just benchmark it myself. Because I was being stupid. I looked at tedv's benchmark and couldn't believe I hadn't bothered to check. Blame it on over two-weeks straight of working without a day off :( A critical portion of regex optmization is what occurs on failure. I created a string which had at least one failure for the regex and then benchmarked the comparisons: `#!/usr/bin/perl -w use strict; use Benchmark; my ( $re, $test ); timethese(-30, { regex1 => '$re = "a" . \'\d\'x500 . "a"; $test = ("a"x2000 . "1"x499)x2 . "a" . "1"x500 . ("a"x +2000 . "1"x499)x2 ; $test =~ /$re/o;', regex2 => '$re = "a" . \'\d{500}\' . "a"; $test = ("a"x2000 . "1"x499)x2 . "a" . "1"x500 . ("a"x +2000 . "1"x499)x2 ; $test =~ /$re/o;', });` [download] Results: `Benchmark: running regex1, regex2, each for at least 30 CPU seconds... regex1: 31 wallclock secs (31.31 usr + 0.00 sys = 31.31 CPU) @ 65 +5.10/s (n=20508) regex2: 31 wallclock secs (30.84 usr + 0.00 sys = 30.84 CPU) @ 53 +0.35/s (n=16358)` [download] After running this a couple of times, I see that `\d{2}` is less efficient than `\d\d`, though not by much. If there is terribly convoluted data I am iterating over, though, it could be an issue. I still don't see why I didn't benchmark it before asking. Thanks for slapping sense into me, tedv :) Cheers, Ovid Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l] [select]
RE: Regex optimizations by dchetlin (Friar) on Oct 25, 2000 at 01:13 UTC
This is a good question. The answer is that the REx engine does not optimize `\d{2}` into `\d\d`. Here's how to check: [~] $ perl -Mre=debug -e'qr/\d{2}/' Freeing REx: `,' Compiling REx `\d{2}' size 4 first at 3 1: CURLY {2,2}(4) 3: DIGIT(0) 4: END(0) stclass `DIGIT' minlen 2 Freeing REx: `\d{2}' [~] $ perl -Mre=debug -e'qr/\d\d/' Freeing REx: `,' Compiling REx `\d\d' size 3 first at 1 1: DIGIT(2) 2: DIGIT(3) 3: END(0) stclass `DIGIT' minlen 2 Freeing REx: `\d\d' [download] Why doesn't it? I don't know. But as the Benchmarks above have noted, I can't imagine that difference being notable enough to really worry about. -dlc	[reply] [d/l] [select]
Re: Regex optimizations by japhy (Canon) on Oct 26, 2000 at 16:10 UTC
Agh! I had a response to this! Where'd it go?! Sigh, it's too early for this. Geez, and I deleted the benchmark I had on my computer. Well, you'll see it later today, then. I basically has to do with how `\d{x,y}` is better than `\d\d\d\d...` when you're using a range like that. `$_="goto+F.print+chop;\n=yhpaj";F1:eval`	[reply]

Back to Seekers of Perl Wisdom