Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Idiomatic optimizations

by thelenm (Vicar)
on Apr 29, 2002 at 18:00 UTC ( #162881=note: print w/replies, xml ) Need Help??

in reply to Idiomatic optimizations

Ever since I read in Mastering Regular Expressions that perl makes a copy of the base string when doing a case-insensitive match, I've tried to use character classes instead of /i.

... Before submitting this post, though, I decided to actually benchmark some variations to see whether character classes were faster. To my surprise, it turns out that /i is about 50% faster in the test I used:

use strict; use Benchmark qw(cmpthese); my $foo = "abcdefghijklmnopqrstuvwxyz"x500; my $re = "[Aa][Bb][Cc]"; cmpthese(1000000, { 'i' => sub { $foo =~ /abc/ig }, 'chars' => sub { $foo =~ /[Aa][Bb][Cc]/og }, 'charvar' => sub { $foo =~ /$re/og }, });
yielding these results on my machine:
Benchmark: timing 1000000 iterations of chars, charvar, i... chars: 2 wallclock secs ( 1.97 usr + 0.00 sys = 1.97 CPU) @ 50 +7614.21/s (n=1000000) charvar: 3 wallclock secs ( 2.04 usr + -0.01 sys = 2.03 CPU) @ 49 +2610.84/s (n=1000000) i: 1 wallclock secs ( 1.31 usr + 0.00 sys = 1.31 CPU) @ 76 +3358.78/s (n=1000000) Rate charvar chars i charvar 492611/s -- -3% -35% chars 507614/s 3% -- -34% i 763359/s 55% 50% --
Results are similar for strings of various lengths. So was Mastering Regular Expressions incorrect, or has the problem just been fixed since it was written?

Replies are listed 'Best First'.
Re: Re: Idiomatic optimizations
by samtregar (Abbot) on Apr 30, 2002 at 07:59 UTC
    The problem with //i isn't (wasn't?) that it's slower on small strings. It's that it uses twice the memory as an equivalent character class. And when you start matching against huge strings that can really make a difference. Try your example against a 50MB string and I think you'll see what I mean. If not you can justly castigate me for being too lazy to test my own assertions.

    Eagerly awaiting the second edition,

      Always pass referances not data structures hence \ operator is an optimistaion sub(\@array) instead of sub(@array)
      Only use what you need from modules use CGI qw(:standard);
      Also I like shortcut operators
      my $i ||=0 ; my $i =shift || 0;
      also I like ? operator instead of if's
      is !~ an optimisation over just negating the result of =~, I dunno but I think !~ looks better
        Just a few remarks (though i've got a sneaking suspicion i'm correcting typos here):

        > my $i ||=0 ;

        here $i will always turn out to be 0 (because of the my operator), so my $i=0; is more efficient.

        > $i?$i=1:$i=0;
        How about $i=$i?1:0; - that's also a bit more readable. (at least to my eyes).


        Don't forget that ?: can get dangerous, not unlike juggling running chainsaws. It's a great show, but is liable to injure yourself something fierce:
        $foo = $a? $b? $c : $d? $e : $f : $g : $h;
        Sometimes an if is more verbose, but undeniably precise.

        Instead of getting carried away with ?:, you can sometimes compact it using the regular logical operators || and &&. It really depends on what you're working with.


        Puh-lease, use some whitespace! Here are some alternatives:

        $i ? $i = 1 : $i = 0; $i = $i ? 1 : 0; $i = !!$i || 0; $foo = $foo ? 1 : 0; # Single-letter variable names: # easy to type, hard to read

        Always pass referances not data structures

        Most references are the root of data structures, so I think you meant "Always pass references instead of flattened hashes or lists". Note that you can't use this if the sub in question doesn't expect it.

        - Yes, I reinvent wheels.
        - Spam: Visit eurotraQ.

      Nah, no castigation here. When I whipped up my test, I had thought that the alphabet x 500 was a pretty big string, but now that I'm thinking clearly that's not very big at all.

      To test out a really big string, I replicated Romeo and Juliet 500 times, read the whole thing into a string, then ran the same regular expressions almost the same regular expressions. I removed /o from the 'chars' sub, which actually made it a little faster. The string was about 70 MB. Here is my new test code:

      use strict; use Benchmark qw(cmpthese); local $/ = undef; open IN, "romeo-and-juliet-500-times.txt"; my $text = <IN>; close IN; # Ten iterations is enough with a 70 MB string! cmpthese(10, { 'i' => sub { $text =~ /abc/ig }, 'chars' => sub { $text =~ /[Aa][Bb][Cc]/g }, });
      To my surprise (again!), the /i version ran in about 1/3 the time as the character-class version. Here is the output on my machine:
      Benchmark: timing 10 iterations of chars, i... chars: 40 wallclock secs (38.37 usr + 0.04 sys = 38.41 CPU) @ 0 +.26/s (n=10) i: 12 wallclock secs (11.43 usr + 0.01 sys = 11.44 CPU) @ 0 +.87/s (n=10) s/iter chars i chars 3.84 -- -70% i 1.14 236% --
      I'm amazed. Am I not testing the right thing? Or has /i really been cleaned up in recent versions of Perl? I'm running 5.6.1.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://162881]
[msh210]: *finally* tracked down the source of nonsensical results: I had sum0 (@ary / $foo) instead of (sum0 @ary) / $foo

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2017-01-17 17:18 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (158 votes). Check out past polls.