Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Comparing Regular Expressions

by vitoco (Pilgrim)
on Aug 31, 2009 at 21:24 UTC ( #792470=perlquestion: print w/ replies, xml ) Need Help??
vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I want to know which of the following regexps to remove spaces around commas is more efficient:

$text =~ s/\s+,\s+|\s+,|,\s+/,/g; $text =~ s/\s*,\s*/,/g;

Common sense (not Common::Sense) tells me that the first one should be faster on large texts with more commas than spaces, because less replaces would be done, but I don't know how to measure that.

Hints?

Comment on Comparing Regular Expressions
Download Code
Re: Comparing Regular Expressions
by toolic (Chancellor) on Aug 31, 2009 at 21:40 UTC
Re: Comparing Regular Expressions
by hexcoder (Hermit) on Aug 31, 2009 at 21:46 UTC
    Of course you can (and should) Benchmark both solutions to be sure (as Perl's regex optimizations can be non-obvious).

    I would expect the first regex to be more expensive because of its alternations.

Re: Comparing Regular Expressions
by Sewi (Friar) on Aug 31, 2009 at 21:51 UTC
    There are some good benchmarking modules, but there may be easier ways:
    perl -le '$text = <STDIN>; $T=time; $TestText=$text; for (1..$ARGV[0]) + { $text =~ s/\s+,\s+|\s+,|,\s+/,/g; } print time - $T; $T=time; $Tes +tText=$text; for (1..$ARGV[0]) { $text =~ s/\s*,\s*/,/g; } print time + - $T;' 1000000 </tmp/text.file
    This oneliner just runs your two regexps on the same text for a specific number of loop runs. Use a number which takes at least 10, better more, seconds for each loop.
    This could also be improved by using Time::HiRes.
      Your benchmark is bogus.

      Realize that s/// actually modifies the string it's operating on. After the first iteration, there will be no spaces around commas left in $text.

      Furthermore, you assign to $TestText, but you never actually use its value.

        You're completly right. Thank you for the hints.

        perl -le '$text = <STDIN>; $T=time; for (1..$ARGV[0]) { my $TestText = + $text; $TestText =~ s/\s+,\s+|\s+,|,\s+/,/g; } print time - $T; $T=t +ime; $TestText=$text; for (1..$ARGV[0]) { my $TestText = $text; $Test +Text =~ s/\s*,\s*/,/g; } print time - $T;' 1000000 </tmp/text.file
        This should correct all bugs.
Re: Comparing Regular Expressions
by ig (Vicar) on Aug 31, 2009 at 21:54 UTC

    You might also consider

    $text =~ s/\s+,\s*|,\s+/,/g;

    The regular expression engine is evolving. You might get quite different relative performance, depending on the version of perl you are using.

Re: Comparing Regular Expressions
by Anonymous Monk on Aug 31, 2009 at 21:59 UTC
    You could try this
    #!/usr/bin/perl -- use strict; use warnings; use Benchmark 'cmpthese'; { my $small = join'', map { ' ' x (int rand 2) . ','. (int rand 2) +.$_ } 0 .. 1_000; cmpthese(-3,{ 'small1' => sub { my $f = $small; $f =~ s/\s+,\s+|\s+,|,\s+/,/ +g; }, 'small2' => sub { my $f = $small; $f =~ s/\s*,\s*/,/g; }, }); undef $small; } print "\n"; { # 1_000_000 was too much for my machine my $bigg = join'', map { ' ' x (int rand 2) . ','. (int rand 2). +$_ } 0 .. 200_000; cmpthese(-3,{ 'bigg1' => sub { my $f = $bigg; $f =~ s/\s+,\s+|\s+,|,\s+/,/g; + }, 'bigg2' => sub { my $f = $bigg; $f =~ s/\s*,\s*/,/g; }, }); undef $bigg; } __END__ perl 5.10.1 Rate small1 small2 small1 365/s -- -69% small2 1171/s 221% -- Rate bigg1 bigg2 bigg1 1.27/s -- -77% bigg2 5.41/s 327% -- perl 5.8.8 Rate small1 small2 small1 785/s -- -51% small2 1587/s 102% -- Rate bigg1 bigg2 bigg1 2.59/s -- -62% bigg2 6.86/s 165% --

      Question: Are both versions of perlrunning in the same box? If so, 5.8.8 seems to be faster that 5.10.1 for both regexps!

      From these results, s/\s*,\s*/,/ is faster than the longer one. Backtracking is better than alternations? Replacing always is better than selective ones?

Re: Comparing Regular Expressions
by JavaFan (Canon) on Sep 01, 2009 at 00:39 UTC
    When writing efficient regular expressions, one of the most important rules is: reduce the number of times the regexp engine starts a match, and then has to backtrack. Another important rule is: let the optimizer do the work. These rules may conflict. ;-)

    \s*,\s* offers a lot of opportunity for backtracking. In principle, the engine can start matching at any place - although the optimizer will prevent that from happening. But still, it will match at places where there's a comma without a surrounding space. The longer regexp will only start matching at spaces and commas. However, the alteration may make it harder for the optimizer.

    So, the only way to be sure it to benchmark. Using data (and perl version) that you'll be working on for real. As the data may actually decide which one is better.

Re: Comparing Regular Expressions
by vitoco (Pilgrim) on Sep 01, 2009 at 15:44 UTC

    Thanks to everyone.

    I did my own benchmark based on your suggestions, and found the following:

    #!perl -w use strict; use warnings; use Benchmark 'cmpthese'; my $test = join '', map { ' ' x (0 == int rand 9) . ',' . ' ' x ((0 = += int rand 3) * int rand 3) . $_ } 0 .. 1_000; cmpthese(-10,{ 'test1' => sub { my $f = $test; $f =~ s/\s+,\s+|\s+,|,\s+/,/g; }, 'testA' => sub { my $f = $test; $f =~ s/\s+,\s+/,/g; $f =~ s/\s+,/,/ +g; $f =~ s/,\s+/,/g; }, 'test2' => sub { my $f = $test; $f =~ s/\s+,\s*|,\s+/,/g; }, 'testB' => sub { my $f = $test; $f =~ s/\s+,\s*/,/g; $f =~ s/,\s+/,/ +g; }, 'test3' => sub { my $f = $test; $f =~ s/\s*,\s*/,/g; }, });

    Results using ActivePerl 5.10.0 on WinXP-SP3:

    1_000: Rate test1 test2 test3 testA testB test1 974/s -- -22% -56% -66% -71% test2 1256/s 29% -- -43% -56% -63% test3 2221/s 128% 77% -- -22% -35% testA 2852/s 193% 127% 28% -- -16% testB 3412/s 250% 172% 54% 20% -- 10_000: Rate test1 test2 test3 testA testB test1 79/s -- -24% -62% -72% -77% test2 103/s 31% -- -50% -63% -70% test3 205/s 161% 99% -- -26% -39% testA 277/s 252% 169% 35% -- -18% testB 339/s 331% 228% 65% 22% -- 100_000: Rate test1 test2 test3 testA testB test1 6.5/s -- -26% -67% -76% -81% test2 8.9/s 36% -- -55% -68% -74% test3 19.6/s 200% 122% -- -28% -42% testA 27.3/s 319% 209% 39% -- -19% testB 33.6/s 414% 279% 71% 23% --

    Results are listed from slower to faster regexp(s).

    Tests 1 and 3 are regexps from my original question, and the intermediate 2's regexp was supplied by ig. A and B are 1 and 2 respectively, but splitted on alternations (replace in phases).

    Possible quick conclusions:

    1. It's better to use backtracking than alternations.
    2. It's better to split a regexp with an alternation whenever is possible.
    3. Itīs better to replace a pattern by the same thing instead of doing some effort to avoid that.

    Comments?

      I wonder why testA consists of three substitutions. Don't the latter two cover all cases already?

        Good point! I just wanted to compare equivalent substitutions.

        Adding testC:

        'testC' => sub { my $f = $test; $f =~ s/\s+,/,/g; $f =~ s/,\s+/,/g; +}, 100_000: Rate test1 test2 test3 testA testC testB test1 6.5/s -- -28% -66% -76% -80% -81% test2 9.0/s 39% -- -53% -67% -73% -73% test3 19.1/s 196% 113% -- -29% -42% -43% testA 27.0/s 318% 201% 41% -- -18% -19% testC 33.1/s 412% 269% 73% 23% -- -1% testB 33.3/s 416% 272% 74% 23% 1% --

        C is 23% faster than A, even running twice over some of the commas in the test string. B doesn't touch a comma twice, but is only 1% faster than C.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://792470]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (11)
As of 2014-07-31 15:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (249 votes), past polls