http://www.perlmonks.org?node_id=162881


in reply to Idiomatic optimizations

Ever since I read in Mastering Regular Expressions that perl makes a copy of the base string when doing a case-insensitive match, I've tried to use character classes instead of /i.

... Before submitting this post, though, I decided to actually benchmark some variations to see whether character classes were faster. To my surprise, it turns out that /i is about 50% faster in the test I used:

use strict; use Benchmark qw(cmpthese); my $foo = "abcdefghijklmnopqrstuvwxyz"x500; my $re = "[Aa][Bb][Cc]"; cmpthese(1000000, { 'i' => sub { $foo =~ /abc/ig }, 'chars' => sub { $foo =~ /[Aa][Bb][Cc]/og }, 'charvar' => sub { $foo =~ /$re/og }, });
yielding these results on my machine:
Benchmark: timing 1000000 iterations of chars, charvar, i... chars: 2 wallclock secs ( 1.97 usr + 0.00 sys = 1.97 CPU) @ 50 +7614.21/s (n=1000000) charvar: 3 wallclock secs ( 2.04 usr + -0.01 sys = 2.03 CPU) @ 49 +2610.84/s (n=1000000) i: 1 wallclock secs ( 1.31 usr + 0.00 sys = 1.31 CPU) @ 76 +3358.78/s (n=1000000) Rate charvar chars i charvar 492611/s -- -3% -35% chars 507614/s 3% -- -34% i 763359/s 55% 50% --
Results are similar for strings of various lengths. So was Mastering Regular Expressions incorrect, or has the problem just been fixed since it was written?

Replies are listed 'Best First'.
Re: Re: Idiomatic optimizations
by samtregar (Abbot) on Apr 30, 2002 at 07:59 UTC
    The problem with //i isn't (wasn't?) that it's slower on small strings. It's that it uses twice the memory as an equivalent character class. And when you start matching against huge strings that can really make a difference. Try your example against a 50MB string and I think you'll see what I mean. If not you can justly castigate me for being too lazy to test my own assertions.

    Eagerly awaiting the second edition,
    -sam

      Always pass referances not data structures hence \ operator is an optimistaion sub(\@array) instead of sub(@array)
      Only use what you need from modules use CGI qw(:standard);
      Also I like shortcut operators
      my $i ||=0 ; my $i =shift || 0;
      also I like ? operator instead of if's
      $i?$i=1:$i=0;
      is !~ an optimisation over just negating the result of =~, I dunno but I think !~ looks better
        Just a few remarks (though i've got a sneaking suspicion i'm correcting typos here):

        > my $i ||=0 ;

        here $i will always turn out to be 0 (because of the my operator), so my $i=0; is more efficient.

        > $i?$i=1:$i=0;
        How about $i=$i?1:0; - that's also a bit more readable. (at least to my eyes).

        Joost.

        Don't forget that ?: can get dangerous, not unlike juggling running chainsaws. It's a great show, but is liable to injure yourself something fierce:
        $foo = $a? $b? $c : $d? $e : $f : $g : $h;
        Sometimes an if is more verbose, but undeniably precise.

        Instead of getting carried away with ?:, you can sometimes compact it using the regular logical operators || and &&. It really depends on what you're working with.

        $i?$i=1:$i=0;

        Puh-lease, use some whitespace! Here are some alternatives:

        $i ? $i = 1 : $i = 0; $i = $i ? 1 : 0; $i = !!$i || 0; $foo = $foo ? 1 : 0; # Single-letter variable names: # easy to type, hard to read

        Always pass referances not data structures

        Most references are the root of data structures, so I think you meant "Always pass references instead of flattened hashes or lists". Note that you can't use this if the sub in question doesn't expect it.

        - Yes, I reinvent wheels.
        - Spam: Visit eurotraQ.
        

      Nah, no castigation here. When I whipped up my test, I had thought that the alphabet x 500 was a pretty big string, but now that I'm thinking clearly that's not very big at all.

      To test out a really big string, I replicated Romeo and Juliet 500 times, read the whole thing into a string, then ran the same regular expressions almost the same regular expressions. I removed /o from the 'chars' sub, which actually made it a little faster. The string was about 70 MB. Here is my new test code:

      use strict; use Benchmark qw(cmpthese); local $/ = undef; open IN, "romeo-and-juliet-500-times.txt"; my $text = <IN>; close IN; # Ten iterations is enough with a 70 MB string! cmpthese(10, { 'i' => sub { $text =~ /abc/ig }, 'chars' => sub { $text =~ /[Aa][Bb][Cc]/g }, });
      To my surprise (again!), the /i version ran in about 1/3 the time as the character-class version. Here is the output on my machine:
      Benchmark: timing 10 iterations of chars, i... chars: 40 wallclock secs (38.37 usr + 0.04 sys = 38.41 CPU) @ 0 +.26/s (n=10) i: 12 wallclock secs (11.43 usr + 0.01 sys = 11.44 CPU) @ 0 +.87/s (n=10) s/iter chars i chars 3.84 -- -70% i 1.14 236% --
      I'm amazed. Am I not testing the right thing? Or has /i really been cleaned up in recent versions of Perl? I'm running 5.6.1.