http://www.perlmonks.org?node_id=638552

dragonchild has asked for the wisdom of the Perl Monks concerning the following question:

I have two very long (>64k) strings of equal lengths - $s1 and $s2. They are strings of bytes, meaning that any value from chr(0) to chr(255) is legal. $s2, however, will not have any chr(0). $s1 may or may not have any. What I need to do is look at each byte in $s1 and if it is chr(0), replace it with the corresponding byte in $s2. So, something like the following code:
sub foo { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; }
foo() could return the resulting string or it could modify $s1 in place. If foo() returns $s1, I'm going to be doing $s1 = foo( $s1, $s2 ); in all cases.

Here's what I've got so far, including Benchmark harness. Whoever comes up with the fastest version earns a meter of beer from me whenever we see each other.

#!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); my $s1 = join '', (do_rand(1) x 100_000); my $s2 = join '', (do_rand(0) x 100_000); cmpthese( -2, { 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, }); sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $n = (shift) ? int(rand(255)) : int(rand(254)) + 1; return chr( $n ); } __END__
Update: It looks like there is a 2-way tie between avar and moritz. I went ahead and wrote an in-place version of moritz's code. Thanks to SuicideJunkie for fixing my stupidity in the test data. The script now looks like:
#!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; #use Test::More no_plan => 1; use Benchmark qw( cmpthese ); my $s1 = do_rand(0, 100_000); my $s2 = do_rand(1, 100_000); my $expected = split1( \$s1, \$s2 ); cmpthese( -3, { 'avar2' => sub { my $s3 = $s1; avar2( \$s3, \$s2 ); # is( $s3, $expected, "avar2" ); }, 'moritz' => sub { my $s3 = $s1; moritz( \$s3, \$s2 ); # is( $s3, $expected, "moritz" ); }, }); sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $$s1; my @s2 = split //, $$s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } $$s1 = join '', @s1; } sub avar2 { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $$s2, pos($$s1), 1/eg; } sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( 0 < ( $pos = index $$s1, "\000", $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $$s2, $pos, 1 ); } } sub do_rand { my ($min, $len) = @_; my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ) } return $n; } __END__
I'm going to keep it open until 24 hours have passed from the initial posting of this node. If no-one gets any faster, both moritz and avar have a meter of beer from me.

My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by avar (Beadle) on Sep 12, 2007 at 14:19 UTC
    This offloads the work to the regex engine and is *much* faster:
    sh-3.1$ perl benchmark.pl Rate split1 substr1 subst split1 5.05/s -- -82% -100% substr1 27.7/s 449% -- -99% subst 3551/s 70273% 12719% -- ok 1 ok 2 1..2
    The function, could be done in-place which would be even faster:
    sub subst { my ($s1, $s2) = @_; my $s3 = $s1; { use bytes; $s3 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; } $s3; }
    The complete benchmark file (now with tests):
    #!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); use Test::More 'no_plan'; my $s1 = join '', (do_rand(1) x 100_000); my $s2 = join '', (do_rand(0) x 100_000); cmpthese( -2, { 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, 'subst' => sub { my $s3 = subst($s1, $s2) }, }); my $s30 = split1( $s1, $s2 ); my $s31 = substr1( $s1, $s2 ); my $s32 = subst( $s1, $s2 ); is($s30, $s31); is($s31, $s32); sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } sub subst { my ($s1, $s2) = @_; my $s3 = $s1; { use bytes; $s3 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; } $s3; } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $n = (shift) ? int(rand(255)) : int(rand(254)) + 1; return chr( $n ); }
      This is the in-place version:
      Rate split1 substr1 subst subst_i +nplace split1 2.93/s -- -72% -100% + -100% substr1 10.3/s 253% -- -99% + -99% subst 901/s 30682% 8623% -- + -24% subst_inplace 1179/s 40176% 11313% 31% + -- ok 1 ok 2 ok 3 1..3
      code:
      sub subst_inplace { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; }

        Drop the unneeded capturing parens for a likely boost in speed. Someone should add "use bytes" to ikegami's tr/// version also.

        - tye        

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by Corion (Patriarch) on Sep 12, 2007 at 14:16 UTC

    I prefer to let the C part of Perl to do the chore of the work. Especially, if you can chuck work down to the regex engine. Or, even better, if you can pretend to use the regex engine but don't fire it up but get a simple character scan to work for you:

    I split $s1 on \0 and then join it back with the corresponding substrings taken from $s2. That reduces the number of comparisons and substrings done in Perl.

    sub map_split { my $ofs = 0; join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } split + /\0/, $s1, -1; };

    On this machine I get:

    Rate split1 substr1 using_str_bit_ops_and_ +tr map_split split1 1.85/s -- -94% -10 +0% -100% substr1 32.9/s 1676% -- -9 +5% -98% using_str_bit_ops_and_tr 714/s 38390% 2067% +-- -62% map_split 1898/s 102263% 5663% 16 +6% --

    Update: Weird - I would have imagined ikegami's method of using the binary AND of strings to be way faster than mine. And it's even faster than avar's, even though I think that avar's method does far less copying of data (and the use of @+ is far more elegant than my approach).

      I was curious - my "goal" when jotting down the code was to use the minimal number of Perl ops and to let Perl do as much work as possible in the C parts of Perl. So I returned a list from map instead of returning a string:

      This returns the list and joins all the fragments later:

      sub map_split { my $ofs = 0; join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } split + /\0/, $s1, -1; };

      This does a partial fragment concatenation in the map and returns a single string, but still does the concatenation later, but of half as many elements - one additional op in the map call:

      sub map_split_join { my $ofs = 0; join "", map { $ofs += length; $_ . substr $s2, $ofs++, 1 } split +/\0/, $s1, -1; };

      I didn't expect the difference to be that big:

      Rate map_split_join subst map_split map_split_join 1697/s -- -7% -13% subst 1828/s 8% -- -6% map_split 1949/s 15% 7% --

      So, I can only recommend reading/listening to When Perl Is Not Quite Fast Enough. Undoubtedly, there is still lots of potential in optimizing this.

      Since there are not so many bytes to substitute it pays off to search for these rare cases instead of working "blindly" with bit ops.

      Creating $mask takes one pass through the string, and then there are three binary bit ops on the string(s). That makes four passes, while only 1/256 of the bytes being zero bytes.

      Thus a solution that actually looks for zeros (one pass) may have an overhead of 3*256 compared to the bit operations to be still as fast. (Rough estimation only)

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by moritz (Cardinal) on Sep 12, 2007 at 14:12 UTC
    Try the index function for finding the zero bytes in $s1.

    (udpate: fixed typo)

    Another update: now with complete code:

    sub windex { my ($s1, $s2) = @_; my $pos = -1; my $null = chr 0; my $len = length $s1; while( ($pos = index $s1, $null, $pos+1) >= 0 ){ substr $s1, $pos, 1, substr($s2, $pos, 1); } return $s1; }

    It's fast, but it won't win the beer:

    Rate split1 substr1 index split1 4.32/s -- -84% -100% substr1 27.7/s 541% -- -99% index 3757/s 86900% 13478% --

    But don't bother, I don't like beer.. ;-)

    Another update: fixed initial value of $pos. See Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) and the follow ups for more details.

      I know I'm late in this thread but wouldn't bother if
      I won't win ;-)
      I wrote the thing in assembler (for IA32, works
      on gnu/gcc and on win32/msvc) ==> by_asm
      Results: [Athlon64/3200+ Win32/588/MsVC6] split1 1.37/s ikegami1 30.6/s substr1 31.4/s avar2 572/s mrm_6 603/s corion 630/s avar2_pos 745/s ikegami2 814/s ikegami3 988/s bart 995/s ikegami4 1378/s mrm_3 1596/s mrm_1 1746/s moritz 1759/s mrm_4 2007/s mrm_5 2695/s by_asm 7736/s [AthlonXP/2500+ Linux/588/gcc4] split1 3.25/s ikegami1 19.3/s substr1 22.2/s avar2 437/s mrm_6 468/s corion 483/s ikegami2 500/s ikegami3 567/s bart 577/s avar2_pos 605/s ikegami4 1102/s mrm_3 1129/s mrm_1 1527/s moritz 1562/s mrm_4 1683/s mrm_5 1818/s by_asm 5618/s [Core2q Q6600@3GHz Linux/588/gcc4] split1 7.62/s substr1 48.3/s ikegami1 70.1/s mrm_6 1129/s avar2 1481/s corion 1627/s avar2_pos 2535/s mrm_3 2575/s ikegami2 3158/s moritz 3188/s mrm_1 3338/s bart 3511/s ikegami3 3519/s ikegami4 3676/s mrm_5 3791/s mrm_4 4278/s by_asm 4524/s
      Note how the core2 starts to be bound by mem bandwith only!
      here wo go (by_asm is at the end) ==> ------------------------------------------------ use 5.8.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); use Test::More; # ==================== my $s1 = do_rand(0, 100_000); my $s2 = do_rand(1, 100_000); my $subs = { split1 => sub { split1 ( my $s3 = $s1, $s2 ); $s3 }, substr1 => sub { substr1 ( my $s3 = $s1, $s2 ); $s3 }, moritz => sub { moritz ( my $s3 = $s1, $s2 ); $s3 }, corion => sub { corion ( my $s3 = $s1, $s2 ); $s3 }, ikegami1 => sub { ikegami1 ( my $s3 = $s1, $s2 ); $s3 }, ikegami2 => sub { ikegami2 ( my $s3 = $s1, $s2 ); $s3 }, bart => sub { bart ( my $s3 = $s1, $s2 ); $s3 }, ikegami3 => sub { ikegami3 ( my $s3 = $s1, $s2 ); $s3 }, ikegami4 => sub { ikegami4 ( my $s3 = $s1, $s2 ); $s3 }, avar2 => sub { avar2 ( my $s3 = $s1, $s2 ); $s3 }, avar2_pos => sub { avar2_pos ( my $s3 = $s1, $s2 ); $s3 }, mrm_1 => sub { mrm_1 ( my $s3 = $s1, $s2 ); $s3 }, mrm_3 => sub { mrm_3 ( my $s3 = $s1, $s2 ); $s3 }, mrm_4 => sub { mrm_4 ( my $s3 = $s1, $s2 ); $s3 }, mrm_5 => sub { mrm_5 ( my $s3 = $s1, $s2 ); $s3 }, mrm_6 => sub { mrm_6 ( my $s3 = $s1, $s2 ); $s3 }, by_asm => sub { by_asm ( my $s3 = $s1, $s2 ); $s3 }, }; { plan 'tests' => scalar keys %{$subs}; my $expected; foreach my $subname ( keys %{$subs} ) { my $sub = $subs->{$subname}; if ( defined $expected ) { is( $sub->(), $expected, "$subname gets same value" ); } else { $expected = $sub->(); ok( defined $expected, "$subname gets some value" ); } } print("done.\n"); } cmpthese( -3, $subs ); # ==================== sub split1 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } $s1 = join '', @s1; } sub substr1 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; for my $idx ( 0 .. length($s1) ) { if ( substr($s1, $idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } } sub moritz { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; my $pos = 0; while ( -1 < ( $pos = index $s1, "\000", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub ikegami1 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; (my $mask = $s1) =~ s/[^\x00]/\xFF/g; $s1 = ($s1 & $mask) | ($s2 & ~$mask); } sub ikegami2 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; (my $mask = $s1) =~ tr/\x00/\xFF/c; $s1 = ($s1 & $mask) | ($s2 & ~$mask); } sub bart { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; (my $mask = $s1) =~ tr/\x00/\xFF/c; $s1 = (($s1 ^ $s2) & $mask) ^ $s2; } sub ikegami3 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; use bytes; (my $mask = $s1) =~ tr/\x00/\xFF/c; $s1 = (($s1 ^ $s2) & $mask) ^ $s2; } sub ikegami4 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; use bytes; (my $mask = $s1) =~ tr/\x00/\xFF/c; $s1 ^= $s2; $s1 &= $mask; $s1 ^= $s2; } sub corion { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; my $ofs = 0; $s1 = join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } +split /\0/, $s1, -1; } sub avar2 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; use bytes; $s1 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; } sub avar2_pos { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; use bytes; $s1 =~ s/\0/substr $s2, pos($s1), 1/eg; } sub mrm_1 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; # from [moritz]'s work use bytes; my $pos = 0; while ( -1 < ( $pos = index $s1, "\x00", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub mrm_3 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; # from moritz's, builds a separate loop of zeros use bytes; my @zeros = (); my $pos = 0; while ( -1 < ( $pos = index $s1, "\x00", $pos+1 ) ) { # ikegam + +i: Added necessary +1 push @zeros, $pos; } for ( @zeros ) { substr( $s1, $_, 1 ) = substr( $s2, $_, 1 ); } } sub mrm_4 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; # from [bart]'s vec() { use bytes; my $pos = 0; while ( -1 < ( $pos = index $s1, "\x00", $pos ) ) { vec( $s1, $pos, 8 ) ||= vec( $s2, $pos, 8 ); } } } sub mrm_5 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; # from moritz's, seeing if four-arg substr() is faster or slower t +han lvalue substr() use bytes; my $pos = 0; while ( -1 < ( $pos = index $s1, "\x00", $pos ) ) { substr( $s1, $pos, 1, substr( $s2, $pos, 1 ) ); } } sub mrm_6 { our $s1; local *s1 = \$_[0]; our $s2; local *s2 = \$_[1]; # from mrn_5, testing bytes::misc explicitly instead of importing use bytes (); my $pos = 0; while ( -1 < ( $pos = bytes::index( $s1, "\x00", $pos ) ) ) { bytes::substr( $s1, $pos, 1, bytes::substr( $s2, $pos, 1 ) ); } } use Inline C => qq{ // ==> inline void by_asm(SV* has_zeros, SV* no_zeros) { STRLEN spacer1, srclen, dstlen; char *src=SvPV(no_zeros, srclen), *spacer2=0, *dst=SvPV(has_zeros, ds +tlen); if( srclen < dstlen ) croak("block length mismatch!"); #ifdef _MSC_VER _asm mov edi, dst _asm mov esi, src _asm mov ecx, dstlen _asm xor eax, eax _asm cld start: _asm repne scasb _asm jne done _asm mov edx, dstlen _asm sub edx, ecx _asm mov ah, byte ptr [-1+esi+edx] _asm mov byte ptr [-1+edi], ah _asm jmp start done: ; #else __asm__ __volatile__( "xorl %%eax, %%eax \\n\\t" "cld \\n\\t" "start: \\n\\t" "repne \\n\\t" "scasb \\n\\t" "jne done \\n\\t" "movl %[l], %%edx \\n\\t" "subl %%ecx, %%edx \\n\\t" "movb -1(%%esi,%%edx), %%ah \\n\\t" "movb %%ah, -1(%%edi) \\n\\t" "jmp start \\n\\t" "done: \\n\\t" : /* no output reg */ : "S"(src),"D"(dst),"c"(dstlen),[l]"m"(dstlen) ); #endif } // <== inline }; # ==================== sub do_rand { my $min = shift; my $len = shift; { my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ); } redo if $min == 0 && $n !~ /\X00/; return $n; } } ------------------------------------------------
      Too late for the meter of beer I guess ...

      Mirco
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by ikegami (Patriarch) on Sep 12, 2007 at 14:13 UTC

    Bit ops?

    sub using_str_bit_ops_and_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ s/[^\x00]/\xFF/g; return ($s1 & $mask) | ($s2 & ~$mask); }

    tr should be faster.

    sub using_str_bit_ops_and_tr { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); }

    Benchmarks:

    Rate split1 substr1 bit_ops map_split split1 1.25/s -- -95% -100% -100% substr1 25.5/s 1939% -- -96% -99% bit_ops 642/s 51313% 2421% -- -71% map_split 2230/s 178496% 8657% 247% --

    Corion's solution humbles my ass-kicking solution! I wonder if a higher density of zeros would boost my rating?

    Update: Added tr variant.
    Update: Added Benchmarks.

      Bit ops?
      sub using_str_bit_ops_and_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); }
      I was thinking in the same direction, except I'd use xor. After all, ((A xor B ) and 0) xor B is B, and ((A xor B) and true) xor B is A, for any value of A and B.

      So:

      sub using_str_bit_xor { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return (($s1 ^ $s2) & $mask ) ^ $s2; }

      Benchmarks: (and yes I did use 64k byte strings):

      my $x = pack 'C*', map int rand(256), 1 .. 64*1024; my $y = pack 'C*', map int rand(256), 1 .. 64*1024; using_str_bit_ops_and_s($x, $y) eq using_str_bit_xor($x, $y) or die "Results are different??"; use Benchmark 'cmpthese'; cmpthese -3, { using_str_bit_ops_and_s => sub{ my $r = using_str_bit_ops_and_s($x, +$y) }, using_str_bit_xor=> sub{ my $r = using_str_bit_xor($x, $y) }, };
      Result:
      Rate using_str_bit_ops_and_s using_str_ +bit_xor using_str_bit_ops_and_s 686/s -- + -16% using_str_bit_xor 821/s 20% + --
      The speed gain is humble, but real.
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by kyle (Abbot) on Sep 12, 2007 at 14:36 UTC

    I've been testing the solutions posted so far along with my own. I post it here just for completeness since others are faster:

    sub kyle { my ($s1, $s2) = @_; my $out = $s1; while ( $s1 =~ m/\000/g ) { my $pos = pos; substr( $out, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $out; }

    I changed the test script slightly so that it checks that the subs in question actually work (they do):

    Since moritz didn't actually post code, I wrote something based on the suggestion:

    sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( 0 < ( $pos = index $s1, "\000", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; }

    This was the fastest on my machine until avar posted a second version. (cmpthese lines truncated because they're so ugly.)

    split1 5.94/s substr1 41.3/s ikegami_s 62.0/s ikegami_tr 2111/s corion 4342/s kyle 5608/s avar 5635/s moritz 6305/s avar2 6334/s 1..9 ok 1 - corion gets some value ok 2 - substr1 gets same value ok 3 - kyle gets same value ok 4 - split1 gets same value ok 5 - moritz gets same value ok 6 - ikegami_tr gets same value ok 7 - ikegami_s gets same value ok 8 - avar gets same value ok 9 - avar2 gets same value
      Do you have the complete file that produced that handy (with all the implementations). And if so could you post it please?:)

        Sure. This also incorporates the better testing from SuicideJunkie. With that change, it seems that moritz is back on top. The speed of a particular algorithm can be influenced pretty heavily by the input data. That is, it's influenced by what branches it has to take during execution...

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by SuicideJunkie (Vicar) on Sep 12, 2007 at 14:35 UTC

    There is an important flaw in your test. The entire 100_000 character string is composed of the exact same characters. There ARE NO NULLS most of the time!

    Try using this instead:

    #!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); my $s1 = do_rand(0,100_000); my $s2 = do_rand(1,100_000); my $nulls = 0; foreach my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ){ $nulls++; } } print "There are $nulls nulls in S1\n"; print "Sample data: [" . substr($s1,2,10) . "]\n"; cmpthese( -2, { 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, }); sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $min = shift; my $len = shift; my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ) } return $n; } __END__
      There is an important flaw in your test. The entire 100_000 character string is composed of the exact same characters. There ARE NO NULLS most of the time!

      That's only a flaw if the data that the algorithm is meant to work with has a different composition than the test data.

      Since we don't know this, we can only assume that dragonchild provided us with test data that looks like the "real" data.

      Update: I finally understood what you are saying... the testing was flawed, really.

        If that were true, then this would be close to the ultimate function:

        sub supertest { my $s1= shift; if (substr($s1,1,1) ne chr(0)) { return $s1; }else{ return shift; } }

        Seems pretty silly to me for that to be the case ;)

        On the "other" test set, my approach fares far more within my expectations, but here the number of elements returned seems to become significant over the number of string joins:

        Rate split1 substr1 map_split subst map_spl +it_join using_str_bit_ops_and_tr split1 1.74/s -- -94% -100% -100% + -100% -100% substr1 30.5/s 1650% -- -93% -94% + -94% -96% map_split 455/s 26002% 1392% -- -13% + -15% -34% subst 520/s 29765% 1607% 14% -- + -3% -25% map_split_join 534/s 30577% 1653% 18% 3% + -- -23% using_str_bit_ops_and_tr 692/s 39604% 2169% 52% 33% + 29% --

        So, you will need to always benchmark with real data! ;)

      Whoops! :-) Thanks for the correction.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by diotalevi (Canon) on Sep 12, 2007 at 19:38 UTC

    [Updated.]

    I've been using Inline lately as a way to eval_xs(). Here are two straight up implementations in C and C++. I first did the C/libc implementation because it was so incredibly straightforward. I followed up with C++ because I wanted to see if I could regain support for unicode on the way. I failed that one - I don't have std::wstring in my g++ and I didn't want to have to look for it. I suspect the C++ stuff is slower than the C because std::string(<char*>...) copies the array.

    I noticed that moritz' code didn't pass tests when it was enabled. An off-by-one error I'm sure.

    # Rate avar2 dio_cpp moritz dio_c #avar2 215/s -- -42% -64% -90% #dio_cpp 374/s 73% -- -38% -82% #moritz 598/s 178% 60% -- -71% #dio_c 2084/s 868% 458% 248% -- # #Later, just the C version with everyone else's # Rate split1 substr1 ikegami_s avar avar2 ikega +mi_tr avar2_pos corion moritz avar2_pos_inplace dio_c #split1 0.994/s -- -88% -91% -99% -99% + -99% -99% -100% -100% -100% -100% #substr1 8.33/s 738% -- -28% -92% -93% + -94% -96% -96% -98% -99% -100% #ikegami_s 11.6/s 1069% 39% -- -89% -90% + -91% -94% -95% -97% -98% -99% #avar 102/s 10168% 1125% 778% -- -9% + -25% -46% -56% -73% -85% -96% #avar2 112/s 11194% 1247% 866% 10% -- + -17% -41% -52% -70% -84% -95% #ikegami_tr 136/s 13548% 1528% 1068% 33% 21% + -- -29% -42% -64% -80% -94% #avar2_pos 191/s 19090% 2189% 1542% 87% 70% + 41% -- -18% -49% -72% -92% #corion 233/s 23354% 2698% 1907% 128% 108% + 72% 22% -- -38% -66% -90% #moritz 374/s 37512% 4387% 3118% 266% 233% + 176% 96% 60% -- -46% -84% #avar2_pos_inplace 691/s 69381% 8188% 5844% 577% 515% + 409% 262% 196% 85% -- -70% #dio_c 2315/s 232762% 27677% 19822% 2168% 1962% +1606% 1113% 893% 519% 235% BEGIN { package diotalevi; local @diotalevi::ISA; my $c_src = <<'XS'; #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include <string.h> #include <string> using namespace std; MODULE = diotalevi PACKAGE = diotalevi void dio_c(dsv_rv, ssv_rv) SV *dsv_rv SV *ssv_rv INIT: SV *dsv; SV *ssv; STRLEN len; char *dpv; char *spv; char *ptr; char *ptr_end; PPCODE: if ( ! ( dsv_rv && ssv_rv && SvRV( dsv_rv ) && SvRV( ssv_rv ) +&& SvPOK(SvRV(dsv_rv)) && SvPOK(SvRV(ssv_rv)))) croak( "Blaaa" ); /* Fetch my SVs */ dsv = SvRV( dsv_rv ); ssv = SvRV( ssv_rv ); /* Fetch my (char*)s. */ dpv = SvPVX( dsv ); spv = SvPVX( ssv ); /* Operate only on the minimum length. */ len = SvCUR( dsv ) < SvCUR( ssv ) ? SvCUR( dsv ) : SvCUR( ssv ); /* Establish bounds. */ ptr = dpv; ptr_end = dpv + len; /* Do it. */ while ( 1 ) { ptr = (char*)memchr( ptr, '\0', ptr_end - ptr ); if ( ! ( ptr && ptr < ptr_end ) ) break; *ptr = *(ptr - dpv + spv); ++ ptr; } if ( G_VOID != GIMME_V ) { XPUSHs( SvRV( ST(0) ) ); } static void dio_cpp( dsv_rv, ssv_rv ) SV * dsv_rv SV * ssv_rv PPCODE: // Dereference the references. SV *dsv = SvRV( dsv_rv ); SV *ssv = SvRV( ssv_rv ); // Get mah lengths const STRLEN lend = SvCUR( dsv ); const STRLEN lens = SvCUR( ssv ); const STRLEN len = min( lend, lens ); // Make mah strings char *tgt = SvPVX( dsv ); const string dstr( tgt, lend ); const string sstr( (char const * const)SvPVX(ssv), lens ); // The Replacements string::size_type offset = 0; while ( 1 ) { offset = dstr.find( '\0', offset ); if ( offset == string::npos ) break; tgt[offset] = sstr[offset]; ++ offset; } if ( G_VOID != GIMME_V ) { XPUSHs( SvRV( ST(0) ) ); } XS Inline->bind( CPP => $c_src, NAME => 'diotalevi', XSMODE => 1, ); *main::dio_c = \&diotalevi::dio_c; *main::dio_cpp = \&diotalevi::dio_cpp; }

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by quidity (Pilgrim) on Sep 12, 2007 at 15:10 UTC

    Here is my dirty approach:

    use Inline C => qq{ SV* swap(SV* one, SV* two) { char *buf1, *buf2, *buf3; STRLEN idx, len; SV* ret; if (!SvPOK(one) || !SvPOK(two) || sv_len(one) > sv_len(two)) return newSVsv(&PL_sv_undef); len = sv_len(one); buf1 = SvPVX(one); buf2 = SvPVX(two); buf3 = malloc(len); for (idx=0; idx < len; idx++) { buf3[idx] = buf1[idx] ? buf1[idx] : buf2[idx]; } ret = newSVpv(buf3, len); free(buf3); return ret; } }; # [ your code, but with ] cmpthese( -5, { 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, 'inline' => sub { my $s3 = swap($s1, $s2) }, });

    Which gives, er, some improvement...

              Rate  split1 substr1  inline
    split1  4.45/s      --    -88%   -100%
    substr1 37.6/s    746%      --    -99%
    inline  6955/s 156241%  18378%      --
    

    Although you have to make assumptions about your data, like it not being encoded in utf8. You can test for all this as part of the C, if you need to.

    If you need to be this specific about direct byte faffing, always think of Inline::C. Then wrap the dirtyness in a module.

      I'm a really sore^Hthirsty loser. So I wrote an XS function that did the replacement in-place by operating directly on the SV'S char vector. It's faster than your swap function but still slower than my s///-pos stuff, I don't really know why.
      use Inline C => q[ void avar_cee(SV *sv1, SV *sv2) { char *sv1p, *sv2p; STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } for (i = 0; i < sv2len; i++) { if (sv1p[i] == '\0') { sv1p[i] = sv2p[i]; } } }
      It could be sped up with something like OpenMP by putting #pragma omp paralell for before the for loop but I didn't have gcc 4.2 to test it. I changed the test script so that it wouldn't be unfair to the in-place functions. Previously I was copying the data anyway so that testing would work which defeated the purpose of doing it in-place. Results:
      split1 5.07/s -- -82% -87% -99% -99% -99% + -99% -100% -100% -100% -100% -100% substr1 27.8/s 449% -- -31% -96% -96% -97% + -97% -98% -98% -99% -99% -99% ikegami_s 40.3/s 695% 45% -- -94% -95% -95% + -96% -97% -98% -98% -99% -99% avar 700/s 13700% 2414% 1636% -- -10% -18% + -30% -42% -62% -70% -80% -86% avar2 778/s 15243% 2695% 1830% 11% -- -8% + -22% -35% -58% -67% -78% -85% corion 848/s 16635% 2948% 2005% 21% 9% -- + -15% -29% -54% -63% -76% -83% ikegami_tr 999/s 19609% 3490% 2379% 43% 28% 18% + -- -17% -45% -57% -72% -80% avar2_pos 1201/s 23591% 4215% 2880% 72% 54% 42% + 20% -- -34% -48% -66% -76% moritz 1832/s 36032% 6481% 4445% 162% 135% 116% + 83% 53% -- -21% -49% -64% swap 2324/s 45742% 8250% 5666% 232% 199% 174% + 133% 93% 27% -- -35% -54% avar_c_inplace 3567/s 70267% 12717% 8751% 410% 359% 320% + 257% 197% 95% 53% -- -30% avar2_pos_inplace 5063/s 99781% 18093% 12464% 624% 551% 497% + 407% 322% 176% 118% 42% --
      benchmark.pl:
      #!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); use Inline C => q[ void avar_c_inplace(SV *sv1, SV *sv2) { char *sv1p, *sv2p; STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } for (i = 0; i < sv2len; i++) { if (sv1p[i] == '\0') { sv1p[i] = sv2p[i]; } } } SV* swap(SV* one, SV* two) { char *buf1, *buf2, *buf3; STRLEN idx, len; SV* ret; if (!SvPOK(one) || !SvPOK(two) || sv_len(one) > sv_len(two)) return newSVsv(&PL_sv_undef); len = sv_len(one); buf1 = SvPVX(one); buf2 = SvPVX(two); buf3 = malloc(len); for (idx=0; idx < len; idx++) { buf3[idx] = buf1[idx] ? buf1[idx] : buf2[idx]; } ret = newSVpv(buf3, len); free(buf3); return ret; } ]; my $s1 = do_rand(0, 100_000); my $s2 = do_rand(1, 100_000); #my $s1 = "a\0b\0c\0d\0"; #my $s2 = "aXbYcZdA"; my @z = $s1 =~ m/(\0)/g; my $num = @z; warn "\$s1 has $num zeroes"; my $cp0 = $s1; my $cp1 = $s1; my %subs = ( 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, # 'kyle' => sub { my $s3 = kyle( $s1, $s2 ) }, 'moritz' => sub { my $s3 = moritz( $s1, $s2 ) }, 'corion' => sub { my $s3 = corion( $s1, $s2 ) }, 'ikegami_s' => sub { my $s3 = ikegami_s( $s1, $s2 ) }, 'ikegami_tr' => sub { my $s3 = ikegami_tr( $s1, $s2 ) }, 'avar' => sub { my $s3 = avar( $s1, $s2 ) }, 'avar2' => sub { my $s3 = avar2( $s1, $s2 ) }, 'avar2_pos' => sub { my $s3 = avar2_pos( $s1, $s2 ) }, 'avar2_pos_inplace' => sub { avar2_pos_inplace( \$cp0, $s2 ); $cp +0 }, 'avar_c_inplace' => sub { avar_c_inplace( \$cp1, $s2 ); $cp1 }, 'swap' => sub { swap( $s1, $s2 ) }, # 'bogus' => sub { "oh noes" }, ); cmpthese( -2, \%subs ); use Test::More; plan 'tests' => scalar keys %subs; my $correct; { my $tmp = $s1; avar2_pos_inplace(\$tmp, $s2); $correct = $tmp; } foreach my $subname ( keys %subs ) { my $sub = $subs{$subname}; ok($sub->() eq $correct, "$subname returned the correct value"); } sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } sub kyle { my ($s1, $s2) = @_; my $out = $s1; while ( $s1 =~ m/\000/g ) { my $pos = pos; substr( $out, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $out; } sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( 0 < ( $pos = index $s1, "\000", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; } sub ikegami_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ s/[^\x00]/\xFF/g; return ($s1 & $mask) | ($s2 & ~$mask); } sub ikegami_tr { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); } sub corion { my ($s1, $s2) = @_; my $ofs = 0; return join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } + split /\0/, $s1, -1; } sub avar { my ($s1, $s2) = @_; my $s3 = $s1; { use bytes; $s3 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; } $s3; } sub avar2 { my ($s1, $s2) = @_; use bytes; $s1 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; return $s1; } sub avar2_pos { my ($s1, $s2) = @_; use bytes; $s1 =~ s/\0/substr $s2, pos($s1), 1/eg; return $s1; } sub avar2_pos_inplace { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $s2, pos($$s1), 1/eg; } sub avar2_pos_inplace2 { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $$s2, pos($$s1), 1/eg; } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $min = shift; my $len = shift; my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ); } return $n; } #sub do_rand { # my $n = (shift) ? int(rand(255)) : int(rand(254)) + 1; # return chr( $n ); #} __END__

        You are calculating array offsets over and over. The fast way to do this in C is more like:

        int len; char* pBeg= SvPV(...,len); char* pSrc= pBeg + len; char* pDst= SvPV(....) + len; while( pBeg <= pSrc ) { if( ! *pDst ) { *pDst= *pSrc; } --pSrc; --pDst; }

        - tye        

        I just tested this snippet on gcc 4.2.1 and it had the same speed on a two-processor, multi-core machine as it did without the OpenMP annotations as well as when it was compiled by gcc 3.whatever.

        #pragma omp parallel for for( ; dsvc < dsvc_end; ++dsvc, ++ssvc ) { if ( ! *dsvc ) *dsvc = *ssvc; }

        Just to see if things changed if I simplified the for(;;) loop, this also had no effect. Or perhaps the openmp version got slower. It's been about a 1000 calls slower the last few times I tried it. I'm sure that's just because the loop has to do the pointer math all the time instead of just incrementing like the previous version. OpenMP just isn't getting enabled, I think.

        /* Do it. */ #pragma omp parallel for for( o = 0; o < len; ++o ) { if ( ! *( dsvc + o ) ) *(dsvc+o) = *(ssvc+o); }

        In the below chart I deleted a pile of uninteresting programs and kept only the ones I was testing and things that performed at least as well as my worst candidate.

        Rate moritz_inplace dio_openmp swap dio_noopenmp + avar_tye_c_inplace avar_c_inplace dio_c moritz_inplace 1948/s -- -64% -66% -68% + -69% -73% -83% dio_openmp 5344/s 174% -- -7% -13% + -15% -26% -54% swap 5745/s 195% 8% -- -6% + -9% -20% -50% dio_noopenmp 6124/s 214% 15% 7% -- + -3% -15% -47% avar_tye_c_inplace 6286/s 223% 18% 9% 3% + -- -12% -45% avar_c_inplace 7178/s 269% 34% 25% 17% + 14% -- -38% dio_c 11495/s 490% 115% 100% 88% + 83% 60% --

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by demerphq (Chancellor) on Sep 12, 2007 at 18:05 UTC

    You picked the wrong one. moritz'es solution is broken. Its easy to be fast if you don't have to be correct. :-) (Hint: what happens if the first character in the string is \0?)

    Seriously if you need to do this then use one of the XS variants. It will be fastest, however proving that it is might be difficult, especially when working with such large strings. Certainly I wouldnt trust Benchmark.pm with the inline modification variants, especially not the XS ones.

    sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( 0 < ( $pos = index $$s1, "\000", $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $$s2, $pos, 1 ); } } my $s1="\0x\0z"; my $s2="ABCD"; moritz(\$s1,\$s2); print $s1 ne 'AxCz' && 'not ','ok';

    The code would be correct if it was

    sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( -1 < ( $pos = index $$s1, "\000", $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $$s2, $pos, 1 ); } }

    But then its possible the benchmark results would change as the original isnt doing *anything* on a string that starts with "\0"

    ---
    $world=~s/war/peace/g

      moritz'es solution is broken.

      Actually, that's my bug. moritz didn't write any code until later but rather just suggested index. I'm the one who didn't think hard enough about how to use it.

        Ah, i just went with the name dragonchild used. My apologies to both of you. :-)

        ---
        $world=~s/war/peace/g

Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by mr_mischief (Monsignor) on Sep 12, 2007 at 18:41 UTC
    Be aware that perl version and platform may matter.

    I have a couple of solutions derived from the work of moritz and avar that show some inconsistency.

    Update: The specific solutions below are incorrect, as dragonchild and ikegami have pointed out to me. The advice about times varying still applies.

    • Windows XP with Strawberry Perl 5.8.8 (Athlon XP 2400+, 1 GB RAM) are faster by about 50% than avar's avar2_pos_inplace for me consistently.
    • Mandriva Linux (x86, Athlon 1000, 512 MB RAM) with perl 5.8.7 they are consistently slower by 10-15%.
    • Using ActiveState 5.8.0 (build 804) on the same code as run against Strawberry shows avar2_pos_inplace beating my solutions by 5-20% which is a much wider range than the Linux perl.
    • v5.8.6 built for cygwin-thread-multi-64int on the XP 2400+ is showing mrm_1 and avar2_pos_inplace in a dead heat, swapping places back and forth between runs.

    In case anyone else wants to try my slight changes:

    sub mrm_1 { my ( $s1, $s2 ) = @_; use bytes; my $pos = 0; while (-1 < ( $pos = index $$s1, '\0', $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub mrm_3 { my ( $s1, $s2 ) = @_; use bytes; my @zeros = (); my $pos = 0; while ( -1 < ( $pos = index $$s1, '\0', $pos ) ) { push @zeros, $pos; } for ( @zeros ) { substr( $$s1, $_, 1 ) = substr( $s2, $_, 1 ); } }

    Interestingly, building the extra loop of indexes in mrm_3 is within the margin of benchmarking error at least on some perls. Sometimes it beats mrm_1 and sometimes it doesn't. At least once on my Linux box they tied (exactly 1393/s each). mrm_2 is about mid-way through the results, so I didn't bother to show it.

    Perhaps more interesting is that Corion's code works for me most of the time, but sometimes tries to perform a substr() outside the string. I haven't tried to figure out exactly why. That goes to show the debugging price for being too clever.

    Update: fixed the index() error that demerphq points out above. The results seem to be coming up about the same. I'm guessing the chances of the test data have '\0' at index 0 hadn't bit very hard yet.

    Update 2: changed a comparison of the ranges between ActiveState's perl and perl on the Linux machine to state so. It did refer incorrectly to Strawberry.

      Okay, I already updated that a couple of times for corrections, and this is new data, so I'm replying to myself.

      Update: As the above node, the implementations are broken, but pass the tests where they should not have. The theme of the node, that different perl builds are doing drastically different things with the same code, stands.

      I downloaded, compiled, and installed 5.9.5 on my Linux box. I also have a few more tweaks I've tried. Here's some result summaries (the Linux box with 5.9.5 -- the first test listed -- is 30 seconds. The rest are still 2):

      • mrm_3, mrm_4, mrm_5, mrm_1, avar2_pos_inplace, and moritz are the tops in 5.9.5 on my 1Ghz, 512MB RAM Athlon with Mandriva 2006 community edition, in that order. They're only separated by 2%, and I ran this test at cmpthese(-30,...) instead of -2 for extra reliability.
      • Strawberry 5.8.8 has them as mrm_1, mrm_2, mrm_4, mrm_4, avar2_pos_inplace, and moritz.
      • AS 5.8.0 has avar2_pos_inplace, mrm_3, mrm_4, mrm_1, mrm_5, and moritz. It shows avar2_pos_inplace ahead by 5-20% the following place still.
      • cygperl 5.8.6 still shows avar2_pos_inplace in a dead heat with several of the mrm_ solutions. The top five change order on nearly every run. moritz's solution comes in sixth reliably.
      • perl 5.8.7 on the Linux box shows avar2_pos_inplace, mrm_1, mrm_4, mrm_5, mrm_3, then moritz. avar2_pos_inplace varies its lead from 4% to about 14% over mrm_1.

      I should note that moritz's solution is between 50% and 75% slower than the top pure-Perl solution in all of these tests, and the rest of the ones I've tested fall below that.

      I should also note that my Linux 5.8.7 does nearly twice as many iterations per second of every solution (of those faster than about 200 iterations per second anyway) than my 5.9.5 does, so I'm curious as to whether that's a development version thing or if my new perl just isn't compiled with as much optimization as the one that came with the distro. Switching to -O4 from -O2 for optimization and replacing some older x86-family lib references in the makefiles and rebuilding doesn't help much. I'm guessing the devel branch just isn't tuned at the source level as much as the stable branch, which makes sense.

      Here's my code for mrm_4 and mrm_5:

      sub mrm_4 { # from [bart]'s vec() my ($s1, $s2) = @_; use bytes; my $pos = 0; while ( -1 < ( $pos = index $$s1, '\0', $pos ) ) { vec( $$s1, $pos, 8 ) ||= vec( $s2, $pos, 8 ); } } sub mrm_5 { # from moritz's, seeing if four-arg substr() is # faster or slower than lvalue substr() my ( $s1, $s2 ) = @_; use bytes; my $pos = 0; while ( -1 < ( $pos = index $$s1, '\0', $pos ) ) { substr( $$s1, $pos, 1, substr( $s2, $pos, 1 ) ); } }
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by bart (Canon) on Sep 12, 2007 at 20:31 UTC
    I had the clever idea of using vec. The Perl code is quite beautiful IMO, but it's slow as a snail.
    sub using_vec { my ($s1, $s2) = @_; vec($s1, $_, 8) ||= vec($s2, $_, 8) for 0 .. length($s1)-1; return $s1; } sub using_str_bit_ops_and_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); } my $x = pack 'C*', map int rand(256), 1 .. 64*1024; my $y = pack 'C*', map 1+int rand(255), 1 .. 64*1024; using_str_bit_ops_and_s($x, $y) eq using_vec($x, $y) or die "Results a +re different??"; use Benchmark 'cmpthese'; cmpthese -3, { using_str_bit_ops_and_s => sub{ my $r = using_str_bit_ops_and_s($x, +$y) }, using_vec=> sub{ my $r = using_vec($x, $y) }, };
    Results:
    Rate using_vec using_str_bit_o +ps_and_s using_vec 51.8/s -- + -92% using_str_bit_ops_and_s 672/s 1196% + --
    12 times slower than ikegami's code I use as a reference implementation.
      It's not vec() that's slowing this down. According to my measurements, this version is one of the fastest pure-Perl ones so far:
      sub mrm_4 { # from [bart]'s vec() my ($s1, $s2) = @_; use bytes; my $pos = 0; vec( $$s1, $pos, 8 ) ||= vec( $s2, $pos, 8 ) while -1 < ( $pos = i +ndex $$s1, '\0', $pos ); }

      It must be the copying and the for loop that are slowing it down compared to working on the referenced scalar and using index().

        This is very interesting. Your version rates at the same speed as moritz's and avar's once the bug is fixed - it's 0<, not -1<.

        One thing that is odd is that the specific problem I'm working has all of the chr(0)'s in groups of 3. So, I figured that I could use that and change the primary line to:

        vec( $$s1, $pos, 24 ) ||= vec( $$s2, $pos, 24 ) while 0 < ( $pos = index $$s1, chr(0)x3, $pos );
        Except, that slows it down by 20%. If I change it so that the 24's stay, but it goes back to being chr(0) without the x3, it's back to being the same speed. I wonder why that is. I also wonder why the knowledge of being able to work 3 bytes at a time doesn't speed things up at all.

        As for why this wasn't in the problem statement - I wanted to solve the general problem and was willing to pay a meter of beer to see the various solutions. That there's an additional constraint in what I'm actually using the solutions for doesn't change what I was willing to pay a meter of beer for. :-)


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by avar (Beadle) on Sep 13, 2007 at 02:39 UTC
    I updated the benchmark to include some of the new postings. speed comparison:
    split1 5.09/s moritz_r 26.8/s moritz_r_inplace 26.8/s substr1 27.8/s vec 39.7/s ikegami_s 41.0/s avar 763/s corion 809/s avar_inplace 919/s nijo 932/s ikegami_tr 940/s avar_pos_inplace 1367/s avar_pos 1400/s moritz 1673/s moritz_r_length 1688/s moritz_r_length_inplace 1773/s moritz_inplace 1786/s swap 3208/s avar_c_inplace 3224/s avar_tye_c_inplace 3239/s diotalevi 9907/s
    benchmark.pl:
    #!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); #use blib; #use Avar; use Inline C => q[ void avar_c_inplace(SV *sv1, SV *sv2) { char *sv1p, *sv2p; STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } for (i = 0; i < sv2len; i++) { if (sv1p[i] == '\0') { sv1p[i] = sv2p[i]; } } } void avar_tye_c_inplace(SV *sv1, SV *sv2) { char *sv1p, *sv2p; STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } char *pSrc = sv2p + sv2len; char *pDst = sv1p + sv2len; while (sv2p <= pSrc) { if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; } } SV* swap(SV* one, SV* two) { char *buf1, *buf2, *buf3; STRLEN idx, len; SV* ret; if (!SvPOK(one) || !SvPOK(two) || sv_len(one) > sv_len(two)) return newSVsv(&PL_sv_undef); len = sv_len(one); buf1 = SvPVX(one); buf2 = SvPVX(two); buf3 = malloc(len); for (idx=0; idx < len; idx++) { buf3[idx] = buf1[idx] ? buf1[idx] : buf2[idx]; } ret = newSVpv(buf3, len); free(buf3); return ret; } ]; BEGIN { package diotalevi; local @diotalevi::ISA; my $c_src = <<'XS'; #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include <string.h> MODULE = diotalevi PACKAGE = diotalevi PREFIX = diotalevi_ void diotalevi_diotalevi(dsv_rv, ssv_rv) SV *dsv_rv SV *ssv_rv INIT: SV *dsv; SV *ssv; STRLEN len; char *dsvc; char *ssvc; char *ptr; char *ptr_end; CODE: if ( ! ( dsv_rv && ssv_rv && SvRV( dsv_rv ) && SvRV( ssv_rv ) +&& SvPOK(SvRV(dsv_rv)) && SvPOK(SvRV(ssv_rv)))) croak( "Blaaa" ); /* Fetch my SVs */ dsv = SvRV( dsv_rv ); ssv = SvRV( ssv_rv ); /* Fetch my (char*)s. */ dsvc = SvPVX( dsv ); ssvc = SvPVX( ssv ); /* Operate only on the minimum length. */ len = SvCUR( dsv ) < SvCUR( ssv ) ? SvCUR( dsv ) : SvCUR( ssv ); /* Establish bounds. */ ptr = dsvc; ptr_end = dsvc + len; /* Do it. */ while ( 1 ) { ptr = memchr( ptr, '\0', ptr_end - ptr ); if ( ! ( ptr && ptr < ptr_end ) ) break; *ptr = *(ptr - dsvc + ssvc); ++ ptr; } XS Inline->bind( C => $c_src, NAME => 'diotalevi', XSMODE => 1, ); *main::diotalevi = \&diotalevi::diotalevi; } my $s1 = do_rand(0, 100_000); my $s2 = do_rand(1, 100_000); #my $s1 = "a\0b\0c\0d\0"; #my $s2 = "aXbYcZdA"; my @z = $s1 =~ m/(\0)/g; my $num = @z; warn "\$s1 has $num zeroes"; my %subs = ( 'split1' => sub { my $cp = $s1; my $s3 = split1( $s1, $s2 ) }, 'substr1' => sub { my $cp = $s1; my $s3 = substr1( $s1, $s2 ) } +, # 'kyle' => sub { my $cp = $s1; my $s3 = kyle( $s1, $s2 ) }, 'moritz' => sub { my $cp = $s1; my $s3 = moritz( $s1, $s2 ) }, 'moritz_inplace' => sub { my $cp = $s1; moritz_inplace( \$cp, +$s2 ); $cp }, 'moritz_r' => sub { my $cp = $s1; my $s3 = moritz_r( $s1, $s2 +) }, 'moritz_r_inplace' => sub { my $cp = $s1; moritz_r_inplace( \$ +cp, $s2 ); $cp }, 'moritz_r_length' => sub { my $cp = $s1; my $s3 = moritz_r_len +gth( $s1, $s2 ) }, 'moritz_r_length_inplace' => sub { my $cp = $s1; moritz_r_leng +th_inplace( \$cp, $s2 ); $cp }, 'corion' => sub { my $cp = $s1; my $s3 = corion( $s1, $s2 ) }, 'ikegami_s' => sub { my $cp = $s1; my $s3 = ikegami_s( $s1, $s2 ) + }, 'ikegami_tr' => sub { my $cp = $s1; my $s3 = ikegami_tr( $s1, $s2 +) }, 'avar' => sub { my $cp = $s1; my $s3 = avar( $s1, $s2 ) }, 'avar_inplace' => sub { my $cp = $s1; avar_inplace( \$cp, $s2 ); +$cp }, 'avar_pos' => sub { my $cp = $s1; my $s3 = avar_pos( $s1, $s2 ) } +, 'avar_pos_inplace' => sub { my $cp = $s1; avar_pos_inplace( \$cp, + $s2 ); $cp }, 'avar_c_inplace' => sub { my $cp = $s1; avar_c_inplace( \$cp, $s2 + ); $cp }, 'avar_tye_c_inplace' => sub { my $cp = $s1; avar_tye_c_inplace( \ +$cp, $s2 ); $cp }, 'diotalevi' => sub { my $cp = $s1; diotalevi(\$cp, \$s2); $cp }, 'swap' => sub { my $s3 = swap( $s1, $s2 ) }, 'vec' => sub { my $s3 = using_vec( $s1, $s2 ) }, 'nijo' => sub { my $s3 = nijo( $s1, $s2 ) }, ); cmpthese( -2, \%subs ); use Test::More; plan 'tests' => scalar keys %subs; my $correct; { my $tmp = $s1; avar_pos_inplace(\$tmp, $s2); $correct = $tmp; } foreach my $subname ( keys %subs ) { my $sub = $subs{$subname}; ok($sub->() eq $correct, "$subname returned the correct value"); } sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } sub kyle { my ($s1, $s2) = @_; my $out = $s1; while ( $s1 =~ m/\000/g ) { my $pos = pos; substr( $out, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $out; } sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( -1 < ( $pos = index $s1, "\000", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; } sub moritz_inplace { my ($s1, $s2) = @_; my $pos = 0; while ( -1 < ( $pos = index $$s1, "\000", $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub moritz_r { my ($s1, $s2) = @_; while ((my $pos = rindex $s1, "\000") != -1) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; } sub moritz_r_inplace { my ($s1, $s2) = @_; while ((my $pos = rindex $$s1, "\000") != -1) { substr( $$s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub moritz_r_length { my ($s1, $s2) = @_; my $pos = length $s1; while (($pos = rindex $s1, "\000", $pos) != -1) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; } sub moritz_r_length_inplace { my ($s1, $s2) = @_; my $pos = length $$s1; while (($pos = rindex $$s1, "\000", $pos) != -1) { substr( $$s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub ikegami_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ s/[^\x00]/\xFF/g; return ($s1 & $mask) | ($s2 & ~$mask); } sub ikegami_tr { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); } sub corion { my ($s1, $s2) = @_; my $ofs = 0; return join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } + split /\0/, $s1, -1; } sub avar_pos { my ($s1, $s2) = @_; use bytes; $s1 =~ s/\0/substr $s2, pos($s1), 1/eg; return $s1; } sub avar_pos_inplace { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $s2, pos($$s1), 1/eg; } sub avar { my ($s1, $s2) = @_; use bytes; $s1 =~ s/\0/substr $s2, $-[0], 1/eg; return $s1; } sub avar_inplace { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $s2, $-[0], 1/eg; } sub using_vec { my ($s1, $s2) = @_; vec($s1, $_, 8) ||= vec($s2, $_, 8) for 0 .. length($s1)-1; return $s1; } sub nijo { use bytes; my ($s1, $s2) = @_; my $idx = 0; my $len; my @chunks = split /\0/, $s1; foreach my $chunk (@chunks) { $len = length($chunk); substr ($s2, $idx, $len) = $chunk; $idx = $idx + 1 + $len; } return $s2; } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $min = shift; my $len = shift; my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ); } return $n; } #sub do_rand { # my $n = (shift) ? int(rand(255)) : int(rand(254)) + 1; # return chr( $n ); #} __END__
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by NiJo (Friar) on Sep 12, 2007 at 22:30 UTC
    This trivial one is basically Moritz version implemented with 'split'. But it comes up as #3. For some reason 'index' is twice as fast in the rate limiting step.

    Some crazy ideas I won't write code for:
    - misusing internal variables, e. g. $RS
    - grep (most likely a slow 2 pass to get the list)
    - rindex (Moritz backwards)
    - unpack (Instead of split)
    - printf
    - vec
    - ikegami_tr with Bit::Vector
    sub nijo { use bytes; my ($s1, $s2) = @_; my $idx = 0; my $len; my @chunks = split /\0/, $s1; foreach my $chunk (@chunks) { $len = length($chunk); substr ($s2, $idx, $len) = $chunk; $idx = $idx + 1 + $len; } return $s2; }
    Rate split1 substr1 ikegami_s avar corion avar2 nijo ike +gami_tr moritz split1 3.07/s -- -84% -86% -99% -99% -99% -99% + -99% -100% substr1 19.4/s 532% -- -8% -95% -96% -96% -96% + -96% -98% ikegami_s 21.2/s 590% 9% -- -95% -95% -95% -96% + -96% -98% avar 410/s 13254% 2013% 1834% -- -8% -8% -20% + -25% -59% corion 445/s 14383% 2192% 1998% 8% -- -0% -13% + -19% -56% avar2 446/s 14428% 2199% 2004% 9% 0% -- -13% + -19% -56% nijo 514/s 16640% 2549% 2324% 25% 16% 15% -- + -6% -49% ikegami_tr 549/s 17784% 2730% 2490% 34% 23% 23% 7% + -- -46% moritz 1009/s 32749% 5098% 4658% 146% 127% 126% 96% + 84% -- 1..9 ok 1 - corion gets some value ok 2 - substr1 gets same value ok 3 - split1 gets same value ok 4 - moritz gets same value ok 5 - nijo gets same value ok 6 - ikegami_tr gets same value ok 7 - ikegami_s gets same value ok 8 - avar gets same value ok 9 - avar2 gets same value
[aside] Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by graq (Curate) on Sep 13, 2007 at 07:16 UTC
    This is an excellent set of nodes. In fact, I would say that this is the best tutorial on (Perl) benchmarking that I have read.

    -=( Graq )=-

      It's what happens when you have a number of very smart people motivated by the best motivation in the world - peer recognition. (The beer doesn't hurt, either).

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
        If future posters should learn anything from this thread it's that you should always motivate perl developers with interesting problems and free drugs:)
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by bduggan (Pilgrim) on Sep 13, 2007 at 20:18 UTC
    I've always wanted an opportunity to use Duff's device :
    void bduggan_duff(SV *sv1, SV *sv2) { /* slightly modified avar_c_inplace */ char *sv1p, *sv2p; STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } i = 0; switch (sv2len % 8) { case 0 : do { if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 1 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 2 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 3 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 4 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 5 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 6 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; case 7 : if (sv1p[i] == '\0') sv1p[i] = sv2p[i]; i++; } while (i<sv2len); } } void bduggan_duff2(SV *sv1, SV *sv2) { /* slightly modified avar_tye_c_inplace */ char *sv1p, *sv2p; + + STRLEN sv1len, sv2len; SV *sv1_sv; STRLEN i; if (!SvROK(sv1) || !SvPOK((SV*)SvRV(sv1)) || !SvPOK(sv2)) { croak("Usage: avar_cee(\$s1, $s2)"); } sv1_sv = (SV*)SvRV(sv1); sv1p = SvPV(sv1_sv, sv1len); sv2p = SvPV(sv2, sv2len); if (sv1len != sv2len) { croak("The given strings must be of the same length"); } char *pSrc = sv2p + sv2len; char *pDst = sv1p + sv2len; switch (sv2len % 8) { case 0 : do { if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 1 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 2 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 3 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 4 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 5 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 6 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; case 7 : if (!*pDst) { *pDst = *pSrc; } --pSrc; --pDst; } while (sv2p <= pSrc); }
    Sadly (or maybe not), they can't compete with memchr. With gcc 3.3.3 the first one seems to come in second place. With gcc 4.1.2, the second one does.
    -Brian
Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by mwah (Hermit) on Sep 19, 2007 at 16:37 UTC
    After putting some stuff as Anonymous (#639826)
    I did some research on the topic.

    What's the status of the beer meter anyway?

    OK, there is no way in pure Perl to come even
    close to a tailored inline solution, so if its
    important one has to use it.
    I compiled several snippets from this thread
    into one benchmark ==> http://hobbit.chemie.uni-halle.de/project/meterofbeer/
    and added some results.

    One very interesting outcome for me was the
    revelation of how dead slow the Core2 architecture
    was at repeated assembly string opcodes
    like   repne scasb. By rewriting the scasb by a sequence of "mov"
    it will be blazingly fast on a Core2.

    There seems to be some error in the
    sub "corion" ==>http://hobbit.chemie.uni-halle.de/project/meterofbeer/beerbench.pl
    maybe somebody can fix it.

    This is on a Core2/Q6600@3GHz (more results in the other link):
                 Rate
    split1     7.35/s
    ikegami1   36.6/s
    substr1    45.9/s
    mrm_6      1169/s
    avar2      1536/s
    corion     1662/s
    avar2_pos  2701/s
    mrm_3      2819/s
    ikegami2   3137/s
    bart       3480/s
    ikegami3   3488/s
    mrm_1      3508/s
    ikegami4   3655/s
    moritz     3719/s
    mrm_5      4271/s
    mrm_4      4346/s
    rep_scasb  4495/s
    inline_c  11310/s
    cmp_movb  11563/s
    Interesting stuff!

    Thanks & bye

    Mirco
      After looking through the entries here
      I found one application of memchr() (by diotalevi)

      After researching into this a bit, I
      found out this is by far the fastest
      thing on any tested platform.

      Why is that?

      By looking into the memchr() sources,
      ist can be seen that it's massively optimized
      for DWORD aligned machine word sized
      access into memory.

      memchr() is, by its assembly code, a nice piece
      of optimized code, especially the positional detection
      and extraction of singe characters.

      One can't beat this with a few lines of assembly.

      I wonder if Perl uses the underlying memchr()
      anywhere in its codebase (regex)?

      (I updated the Benchmark sources and results.)

      Regards
      Mirco