Comparing Regular Expressions

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing Regular Expressions by toolic (Bishop) on Aug 31, 2009 at 21:40 UTC
Benchmarking Your Code	[reply]
Re: Comparing Regular Expressions by JavaFan (Canon) on Sep 01, 2009 at 00:39 UTC
When writing efficient regular expressions, one of the most important rules is: reduce the number of times the regexp engine starts a match, and then has to backtrack. Another important rule is: let the optimizer do the work. These rules may conflict. ;-) `\s,\s` offers a lot of opportunity for backtracking. In principle, the engine can start matching at any place - although the optimizer will prevent that from happening. But still, it will match at places where there's a comma without a surrounding space. The longer regexp will only start matching at spaces and commas. However, the alteration may make it harder for the optimizer. So, the only way to be sure it to benchmark. Using data (and perl version) that you'll be working on for real. As the data may actually decide which one is better.	[reply] [d/l]
Re: Comparing Regular Expressions by ig (Vicar) on Aug 31, 2009 at 21:54 UTC
You might also consider `$text =~ s/\s+,\s*\|,\s+/,/g;` [download] The regular expression engine is evolving. You might get quite different relative performance, depending on the version of perl you are using.	[reply] [d/l]
Re: Comparing Regular Expressions by hexcoder (Curate) on Aug 31, 2009 at 21:46 UTC
Of course you can (and should) Benchmark both solutions to be sure (as Perl's regex optimizations can be non-obvious). I would expect the first regex to be more expensive because of its alternations.	[reply]
Re: Comparing Regular Expressions by Anonymous Monk on Aug 31, 2009 at 21:59 UTC
You could try this #!/usr/bin/perl -- use strict; use warnings; use Benchmark 'cmpthese'; { my $small = join'', map { ' ' x (int rand 2) . ','. (int rand 2) +.$_ } 0 .. 1_000; cmpthese(-3,{ 'small1' => sub { my $f = $small; $f =~ s/\s+,\s+\|\s+,\|,\s+/,/ +g; }, 'small2' => sub { my $f = $small; $f =~ s/\s,\s/,/g; }, }); undef $small; } print "\n"; { # 1_000_000 was too much for my machine my $bigg = join'', map { ' ' x (int rand 2) . ','. (int rand 2). +$_ } 0 .. 200_000; cmpthese(-3,{ 'bigg1' => sub { my $f = $bigg; $f =~ s/\s+,\s+\|\s+,\|,\s+/,/g; + }, 'bigg2' => sub { my $f = $bigg; $f =~ s/\s,\s/,/g; }, }); undef $bigg; } __END__ perl 5.10.1 Rate small1 small2 small1 365/s -- -69% small2 1171/s 221% -- Rate bigg1 bigg2 bigg1 1.27/s -- -77% bigg2 5.41/s 327% -- perl 5.8.8 Rate small1 small2 small1 785/s -- -51% small2 1587/s 102% -- Rate bigg1 bigg2 bigg1 2.59/s -- -62% bigg2 6.86/s 165% -- [download]	[reply] [d/l]
Re^2: Comparing Regular Expressions by vitoco (Hermit) on Sep 01, 2009 at 14:50 UTC
Question: Are both versions of `perl`running in the same box? If so, 5.8.8 seems to be faster that 5.10.1 for both regexps! From these results, `s/\s,\s/,/` is faster than the longer one. Backtracking is better than alternations? Replacing always is better than selective ones?	[reply] [d/l] [select]
Re: Comparing Regular Expressions by vitoco (Hermit) on Sep 01, 2009 at 15:44 UTC
Thanks to everyone. I did my own benchmark based on your suggestions, and found the following: #!perl -w use strict; use warnings; use Benchmark 'cmpthese'; my $test = join '', map { ' ' x (0 == int rand 9) . ',' . ' ' x ((0 = += int rand 3) * int rand 3) . $_ } 0 .. 1_000; cmpthese(-10,{ 'test1' => sub { my $f = $test; $f =~ s/\s+,\s+\|\s+,\|,\s+/,/g; }, 'testA' => sub { my $f = $test; $f =~ s/\s+,\s+/,/g; $f =~ s/\s+,/,/ +g; $f =~ s/,\s+/,/g; }, 'test2' => sub { my $f = $test; $f =~ s/\s+,\s\|,\s+/,/g; }, 'testB' => sub { my $f = $test; $f =~ s/\s+,\s/,/g; $f =~ s/,\s+/,/ +g; }, 'test3' => sub { my $f = $test; $f =~ s/\s,\s/,/g; }, }); [download] Results using ActivePerl 5.10.0 on WinXP-SP3: 1_000: Rate test1 test2 test3 testA testB test1 974/s -- -22% -56% -66% -71% test2 1256/s 29% -- -43% -56% -63% test3 2221/s 128% 77% -- -22% -35% testA 2852/s 193% 127% 28% -- -16% testB 3412/s 250% 172% 54% 20% -- 10_000: Rate test1 test2 test3 testA testB test1 79/s -- -24% -62% -72% -77% test2 103/s 31% -- -50% -63% -70% test3 205/s 161% 99% -- -26% -39% testA 277/s 252% 169% 35% -- -18% testB 339/s 331% 228% 65% 22% -- 100_000: Rate test1 test2 test3 testA testB test1 6.5/s -- -26% -67% -76% -81% test2 8.9/s 36% -- -55% -68% -74% test3 19.6/s 200% 122% -- -28% -42% testA 27.3/s 319% 209% 39% -- -19% testB 33.6/s 414% 279% 71% 23% -- [download] Results are listed from slower to faster regexp(s). Tests 1 and 3 are regexps from my original question, and the intermediate 2's regexp was supplied by ig. A and B are 1 and 2 respectively, but splitted on alternations (replace in phases). Possible quick conclusions: It's better to use backtracking than alternations. It's better to split a regexp with an alternation whenever is possible. It�s better to replace a pattern by the same thing instead of doing some effort to avoid that. Comments?	[reply] [d/l] [select]
Re^2: Comparing Regular Expressions by JavaFan (Canon) on Sep 01, 2009 at 15:55 UTC
I wonder why testA consists of three substitutions. Don't the latter two cover all cases already?	[reply]
Re^3: Comparing Regular Expressions by vitoco (Hermit) on Sep 01, 2009 at 16:37 UTC
Good point! I just wanted to compare equivalent substitutions. Adding testC: `'testC' => sub { my $f = $test; $f =~ s/\s+,/,/g; $f =~ s/,\s+/,/g; +}, 100_000: Rate test1 test2 test3 testA testC testB test1 6.5/s -- -28% -66% -76% -80% -81% test2 9.0/s 39% -- -53% -67% -73% -73% test3 19.1/s 196% 113% -- -29% -42% -43% testA 27.0/s 318% 201% 41% -- -18% -19% testC 33.1/s 412% 269% 73% 23% -- -1% testB 33.3/s 416% 272% 74% 23% 1% --` [download] C is 23% faster than A, even running twice over some of the commas in the test string. B doesn't touch a comma twice, but is only 1% faster than C.	[reply] [d/l]
Re: Comparing Regular Expressions by Sewi (Friar) on Aug 31, 2009 at 21:51 UTC
There are some good benchmarking modules, but there may be easier ways: `perl -le '$text = <STDIN>; $T=time; $TestText=$text; for (1..$ARGV[0]) + { $text =~ s/\s+,\s+\|\s+,\|,\s+/,/g; } print time - $T; $T=time; $Tes +tText=$text; for (1..$ARGV[0]) { $text =~ s/\s,\s/,/g; } print time + - $T;' 1000000 </tmp/text.file` [download] This oneliner just runs your two regexps on the same text for a specific number of loop runs. Use a number which takes at least 10, better more, seconds for each loop. This could also be improved by using Time::HiRes.	[reply] [d/l]
Re^2: Comparing Regular Expressions by JavaFan (Canon) on Sep 01, 2009 at 00:31 UTC
Your benchmark is bogus. Realize that `s///` actually modifies the string it's operating on. After the first iteration, there will be no spaces around commas left in `$text`. Furthermore, you assign to `$TestText`, but you never actually use its value.	[reply] [d/l] [select]
Re^3: Comparing Regular Expressions by Sewi (Friar) on Sep 01, 2009 at 07:13 UTC
You're completly right. Thank you for the hints. `perl -le '$text = <STDIN>; $T=time; for (1..$ARGV[0]) { my $TestText = + $text; $TestText =~ s/\s+,\s+\|\s+,\|,\s+/,/g; } print time - $T; $T=t +ime; $TestText=$text; for (1..$ARGV[0]) { my $TestText = $text; $Test +Text =~ s/\s,\s/,/g; } print time - $T;' 1000000 </tmp/text.file` [download] This should correct all bugs.	[reply] [d/l]


Do you know where your variables are?
	PerlMonks