vitoco has asked for the wisdom of the Perl Monks concerning the following question:
I want to know which of the following regexps to remove spaces around commas is more efficient:
$text =~ s/\s+,\s+|\s+,|,\s+/,/g;
$text =~ s/\s*,\s*/,/g;
Common sense (not Common::Sense) tells me that the first one should be faster on large texts with more commas than spaces, because less replaces would be done, but I don't know how to measure that.
Hints?
Re: Comparing Regular Expressions
by toolic (Bishop) on Aug 31, 2009 at 21:40 UTC
|
| [reply] |
Re: Comparing Regular Expressions
by JavaFan (Canon) on Sep 01, 2009 at 00:39 UTC
|
When writing efficient regular expressions, one of the most important rules is: reduce the number of times the regexp engine starts a match, and then has to backtrack. Another important rule is: let the optimizer do the work. These rules may conflict. ;-)
\s*,\s* offers a lot of opportunity for backtracking. In principle, the engine can start matching at any place - although the optimizer will prevent that from happening. But still, it will match at places where there's a comma without a surrounding space. The longer regexp will only start matching at spaces and commas. However, the alteration may make it harder for the optimizer.
So, the only way to be sure it to benchmark. Using data (and perl version) that you'll be working on for real. As the data may actually decide which one is better. | [reply] [d/l] |
Re: Comparing Regular Expressions
by ig (Vicar) on Aug 31, 2009 at 21:54 UTC
|
$text =~ s/\s+,\s*|,\s+/,/g;
The regular expression engine is evolving. You might get quite different relative performance, depending on the version of perl you are using. | [reply] [d/l] |
Re: Comparing Regular Expressions
by hexcoder (Curate) on Aug 31, 2009 at 21:46 UTC
|
| [reply] |
Re: Comparing Regular Expressions
by Anonymous Monk on Aug 31, 2009 at 21:59 UTC
|
#!/usr/bin/perl --
use strict;
use warnings;
use Benchmark 'cmpthese';
{
my $small = join'', map { ' ' x (int rand 2) . ','. (int rand 2)
+.$_ } 0 .. 1_000;
cmpthese(-3,{
'small1' => sub { my $f = $small; $f =~ s/\s+,\s+|\s+,|,\s+/,/
+g; },
'small2' => sub { my $f = $small; $f =~ s/\s*,\s*/,/g; },
});
undef $small;
}
print "\n";
{
# 1_000_000 was too much for my machine
my $bigg = join'', map { ' ' x (int rand 2) . ','. (int rand 2).
+$_ } 0 .. 200_000;
cmpthese(-3,{
'bigg1' => sub { my $f = $bigg; $f =~ s/\s+,\s+|\s+,|,\s+/,/g;
+ },
'bigg2' => sub { my $f = $bigg; $f =~ s/\s*,\s*/,/g; },
});
undef $bigg;
}
__END__
perl 5.10.1
Rate small1 small2
small1 365/s -- -69%
small2 1171/s 221% --
Rate bigg1 bigg2
bigg1 1.27/s -- -77%
bigg2 5.41/s 327% --
perl 5.8.8
Rate small1 small2
small1 785/s -- -51%
small2 1587/s 102% --
Rate bigg1 bigg2
bigg1 2.59/s -- -62%
bigg2 6.86/s 165% --
| [reply] [d/l] |
|
Question: Are both versions of perlrunning in the same box? If so, 5.8.8 seems to be faster that 5.10.1 for both regexps!
From these results, s/\s*,\s*/,/ is faster than the longer one. Backtracking is better than alternations? Replacing always is better than selective ones?
| [reply] [d/l] [select] |
Re: Comparing Regular Expressions
by vitoco (Hermit) on Sep 01, 2009 at 15:44 UTC
|
Thanks to everyone.
I did my own benchmark based on your suggestions, and found the following:
#!perl -w
use strict;
use warnings;
use Benchmark 'cmpthese';
my $test = join '', map { ' ' x (0 == int rand 9) . ',' . ' ' x ((0 =
+= int rand 3) * int rand 3) . $_ } 0 .. 1_000;
cmpthese(-10,{
'test1' => sub { my $f = $test; $f =~ s/\s+,\s+|\s+,|,\s+/,/g; },
'testA' => sub { my $f = $test; $f =~ s/\s+,\s+/,/g; $f =~ s/\s+,/,/
+g; $f =~ s/,\s+/,/g; },
'test2' => sub { my $f = $test; $f =~ s/\s+,\s*|,\s+/,/g; },
'testB' => sub { my $f = $test; $f =~ s/\s+,\s*/,/g; $f =~ s/,\s+/,/
+g; },
'test3' => sub { my $f = $test; $f =~ s/\s*,\s*/,/g; },
});
Results using ActivePerl 5.10.0 on WinXP-SP3:
1_000:
Rate test1 test2 test3 testA testB
test1 974/s -- -22% -56% -66% -71%
test2 1256/s 29% -- -43% -56% -63%
test3 2221/s 128% 77% -- -22% -35%
testA 2852/s 193% 127% 28% -- -16%
testB 3412/s 250% 172% 54% 20% --
10_000:
Rate test1 test2 test3 testA testB
test1 79/s -- -24% -62% -72% -77%
test2 103/s 31% -- -50% -63% -70%
test3 205/s 161% 99% -- -26% -39%
testA 277/s 252% 169% 35% -- -18%
testB 339/s 331% 228% 65% 22% --
100_000:
Rate test1 test2 test3 testA testB
test1 6.5/s -- -26% -67% -76% -81%
test2 8.9/s 36% -- -55% -68% -74%
test3 19.6/s 200% 122% -- -28% -42%
testA 27.3/s 319% 209% 39% -- -19%
testB 33.6/s 414% 279% 71% 23% --
Results are listed from slower to faster regexp(s).
Tests 1 and 3 are regexps from my original question, and the intermediate 2's regexp was supplied by ig. A and B are 1 and 2 respectively, but splitted on alternations (replace in phases).
Possible quick conclusions:
- It's better to use backtracking than alternations.
- It's better to split a regexp with an alternation whenever is possible.
- Itīs better to replace a pattern by the same thing instead of doing some effort to avoid that.
Comments?
| [reply] [d/l] [select] |
|
I wonder why testA consists of three substitutions. Don't the latter two cover all cases already?
| [reply] |
|
'testC' => sub { my $f = $test; $f =~ s/\s+,/,/g; $f =~ s/,\s+/,/g;
+},
100_000:
Rate test1 test2 test3 testA testC testB
test1 6.5/s -- -28% -66% -76% -80% -81%
test2 9.0/s 39% -- -53% -67% -73% -73%
test3 19.1/s 196% 113% -- -29% -42% -43%
testA 27.0/s 318% 201% 41% -- -18% -19%
testC 33.1/s 412% 269% 73% 23% -- -1%
testB 33.3/s 416% 272% 74% 23% 1% --
C is 23% faster than A, even running twice over some of the commas in the test string. B doesn't touch a comma twice, but is only 1% faster than C.
| [reply] [d/l] |
Re: Comparing Regular Expressions
by Sewi (Friar) on Aug 31, 2009 at 21:51 UTC
|
There are some good benchmarking modules, but there may be easier ways: perl -le '$text = <STDIN>; $T=time; $TestText=$text; for (1..$ARGV[0])
+ { $text =~ s/\s+,\s+|\s+,|,\s+/,/g; } print time - $T; $T=time; $Tes
+tText=$text; for (1..$ARGV[0]) { $text =~ s/\s*,\s*/,/g; } print time
+ - $T;' 1000000 </tmp/text.file
This oneliner just runs your two regexps on the same text for a specific number of loop runs. Use a number which takes at least 10, better more, seconds for each loop.
This could also be improved by using Time::HiRes. | [reply] [d/l] |
|
Your benchmark is bogus.
Realize that s/// actually modifies the string it's operating on. After the first iteration, there will be no spaces around commas left in $text.
Furthermore, you assign to $TestText, but you never actually use its value.
| [reply] [d/l] [select] |
|
You're completly right. Thank you for the hints.
perl -le '$text = <STDIN>; $T=time; for (1..$ARGV[0]) { my $TestText =
+ $text; $TestText =~ s/\s+,\s+|\s+,|,\s+/,/g; } print time - $T; $T=t
+ime; $TestText=$text; for (1..$ARGV[0]) { my $TestText = $text; $Test
+Text =~ s/\s*,\s*/,/g; } print time - $T;' 1000000 </tmp/text.file
This should correct all bugs. | [reply] [d/l] |
|
|