Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
There's more than one way to do things
 
PerlMonks  

Benchmarking "Are all these characters in this sentence?"

by RMGir (Parson)
on Aug 28, 2008 at 22:52 UTC ( #707621=perlmeditation: print w/ replies, xml ) Need Help??

(Edit: Updated based on comments below, and cleaned up. I've broken out the tests from the benchmark, done the tests using Test::More, and factored out the list of test cases.

I've also added a few more tests from tassilo, although some of them failed the tests and are commendted out for now. )

Another Edit: repellent wins!! unpack is the way to go...

_Another_ Another Edit: tye wins! Not building a list is even better... See tye's post. I have updated this post w/ the tye results.

_Another_^3 Edit: BrowserUK wins! Well, after the fix so it isn't so destructive of the testArray.

A variant of BrowserUK's approach that uses substr comes in second, but a long ways behind chop.

_Another_^4 Edit: BrowserUK wins again!

buk2 is by far the fastest, but fails a "0", "0" test, so buk2_len which is a bit slower is the real winner, still 10 times faster than the competition for some test sets.

For REALLY fast routines being benchmarked, copying args is clearly a very bad idea. However, other routines like tye0 to tye2 which were the fastest for a while don't benefit as much from removing the argument copies (see the _opt variants in the results).

_Another_^5 Edit: BrowserUK wins again (again)!

buk3 is by far the fastest and passes all the tests

In check if all certain chars are found in a sentence, tallulah asked for improvements on a routine trying to check whether a given sentence contains all of a set of specified characters.

Many monks suggested solutions, with way more than one way to do it.

Here are the results of some benchmarks on all of the suggested approaches on my Macbook Pro, using the provided 5.8.8 build.

I was quite surprised by the results, actually... I'd have bet good money that the slice approach was going to beat anything else, unless Tanktalus's List::MoreUtils approaches won. I promise, I was going to post this no wonder which implementation won :)

Where a solution was regex based, I also added a variant using "study", to see how that impacted performance.

If anyone wants to add a few test cases then run the tests, invoke the script with a -t argument.

Detailed results behind readmore, to fix Meditations page formatting:

Short                                                                   
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                              105/s              --
Tanktalus_AllRegex_Study                           127/s             22%
Tanktalus_AllRegex                                 130/s             24%
varian_hash                                        146/s             40%
RMGir_slice                                        207/s             98%
moritz_BuildRegex_WithStudy                        243/s            133%
moritz_BuildRegex                                  261/s            149%
ysth_loookahead                                    350/s            234%
Tanktalus_AllIndex                                 411/s            293%
unpack_allindex                                    444/s            324%
unpack_allrindex                                   448/s            328%
tassilo_listutils_r                                581/s            455%
tye2                                               655/s            525%
tye1                                               661/s            531%
RMGir_index                                        667/s            537%
repellent_unpack                                   770/s            635%
repellent_unpack_opt                               792/s            657%
tye2_opt                                           807/s            671%
tye1_opt                                           815/s            678%
tye0                                               829/s            692%
tye0_opt                                           830/s            693%
buk_substr                                        1152/s           1000%
buk                                               1829/s           1648%
buk2_len                                          4443/s           4144%
buk2                                              4610/s           4305%
buk3                                              5144/s           4814%
                                                                        
                                                                        
LongShort                                                               
                                                    Rate     varian_hash
varian_hash                                       4.90/s              --
RMGir_slice                                       5.88/s             20%
tallulah_OriginalPost                              168/s           3319%
Tanktalus_AllRegex_Study                           173/s           3422%
Tanktalus_AllRegex                                 209/s           4171%
moritz_BuildRegex_WithStudy                        243/s           4867%
moritz_BuildRegex                                  396/s           7983%
ysth_loookahead                                    575/s          11640%
Tanktalus_AllIndex                                 731/s          14822%
unpack_allrindex                                   799/s          16201%
unpack_allindex                                    821/s          16650%
tassilo_listutils_r                                914/s          18549%
RMGir_index                                       1017/s          20652%
tye2                                              1018/s          20671%
tye1                                              1046/s          21234%
repellent_unpack                                  1158/s          23518%
repellent_unpack_opt                              1207/s          24515%
tye2_opt                                          1386/s          28180%
tye0_opt                                          1399/s          28442%
tye0                                              1414/s          28746%
tye1_opt                                          1440/s          29269%
buk_substr                                        1761/s          35815%
buk                                               2849/s          58012%
buk2_len                                          9489/s         193482%
buk2                                              9686/s         197486%
buk3                                             11005/s         224395%
                                                                        
                                                                        
ShortLong                                                               
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                             3.20/s              --
Tanktalus_AllRegex                                4.42/s             38%
Tanktalus_AllRegex_Study                          4.42/s             38%
moritz_BuildRegex_WithStudy                       5.36/s             67%
moritz_BuildRegex                                 5.41/s             69%
varian_hash                                       8.18/s            156%
ysth_loookahead                                   9.26/s            189%
RMGir_index                                       14.4/s            351%
tassilo_listutils_r                               15.9/s            396%
RMGir_slice                                       15.9/s            396%
Tanktalus_AllIndex                                16.0/s            401%
repellent_unpack_opt                              16.8/s            425%
repellent_unpack                                  17.1/s            435%
unpack_allrindex                                  20.0/s            525%
unpack_allindex                                   20.2/s            531%
tye2_opt                                          25.5/s            696%
tye0_opt                                          25.7/s            703%
tye0                                              25.7/s            704%
tye1_opt                                          26.2/s            719%
tye2                                              26.9/s            739%
tye1                                              27.4/s            755%
buk_substr                                        64.0/s           1900%
buk                                               67.9/s           2021%
buk2_len                                           888/s          27649%
buk2                                               939/s          29247%
buk3                                              1109/s          34553%
                                                                        
                                                                        
LongLong                                                                
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                             3.20/s              --
Tanktalus_AllRegex_Study                          4.46/s             40%
Tanktalus_AllRegex                                4.50/s             41%
varian_hash                                       4.59/s             43%
moritz_BuildRegex                                 5.41/s             69%
moritz_BuildRegex_WithStudy                       5.41/s             69%
RMGir_slice                                       6.80/s            112%
ysth_loookahead                                   9.40/s            194%
RMGir_index                                       12.9/s            302%
repellent_unpack_opt                              15.1/s            372%
repellent_unpack                                  15.6/s            387%
tassilo_listutils_r                               16.0/s            399%
Tanktalus_AllIndex                                16.0/s            401%
unpack_allindex                                   20.4/s            537%
unpack_allrindex                                  20.6/s            543%
tye1_opt                                          25.9/s            708%
tye0                                              26.0/s            711%
tye2_opt                                          26.0/s            711%
tye0_opt                                          26.2/s            719%
tye2                                              27.3/s            752%
tye1                                              27.3/s            752%
buk_substr                                        63.6/s           1889%
buk                                               67.3/s           2002%
buk2_len                                           852/s          26537%
buk2                                               896/s          27908%
buk3                                              1017/s          31690%
                                                                        
                                                                        
VeryLong                                                                
                                                    Rate     varian_hash
varian_hash                                       8.41/s              --
RMGir_slice                                       10.2/s             21%
tallulah_OriginalPost                              189/s           2152%
Tanktalus_AllRegex_Study                           205/s           2332%
Tanktalus_AllRegex                                 245/s           2810%
moritz_BuildRegex_WithStudy                        346/s           4019%
moritz_BuildRegex                                  508/s           5942%
ysth_loookahead                                    800/s           9411%
Tanktalus_AllIndex                                1090/s          12861%
unpack_allrindex                                  1184/s          13980%
tassilo_listutils_r                               1207/s          14245%
unpack_allindex                                   1230/s          14521%
tye2_opt                                          1267/s          14963%
tye1_opt                                          1268/s          14974%
tye0                                              1279/s          15106%
tye2                                              1280/s          15118%
RMGir_index                                       1280/s          15118%
tye0_opt                                          1292/s          15264%
tye1                                              1297/s          15323%
repellent_unpack                                  1506/s          17801%
repellent_unpack_opt                              1527/s          18058%
buk_substr                                        2634/s          31219%
buk                                               2830/s          33543%
buk2_len                                         24093/s         286343%
buk2                                             25048/s         297687%
buk3                                             28768/s         341917%

I'd be interested in seeing how these numbers change for 5.10, if anyone has an installation handy. The benchmark code is behind the second readmore tag.
#!/usr/bin/perl -w use strict; use List::MoreUtils qw(all); use Benchmark qw(cmpthese); #use Test::More qw(no_plan); require Test::More; my @shortTestCases=( # sentence wantedChars result [ "abxcd zwe rrv", "0", 0 ], [ "0", "0", 1 ], [ "abxcd zwe rrv", "xxv", 1 ], [ "abxcd zwe rrv", "xxvq", 0 ], [ "abxcd zwe rrv", "", 1 ], [ "The quick brown fox jumps over the lazy dog", "abcdefghijklmnopqrstuvwxyz", 1 ], [ "The quick brown fox jumps over the lazy dog", "abcdefghijklmnopqrstuvwxyzT", 1 ], [ "The quick brown fox jumps over the lazy dog", "abcdefghijklmnopqrstuvwxyzTU", 0 ], [ "The quick brown fox jumps over the lazy dog", "a", 1 ], [ "The quick brown fox jumps over the lazy dog", "", 1 ], ); # Long sentence, short wantedChars my @longShortTestCases = ( [ "The quick brown fox jumps over the lazy dog" x 100, "", 1 ], [ "The quick brown fox jumps over the lazy dog" x 100, "a", 1 ], [ "The quick brown fox jumps over the lazy dog" x 100, "abcdefghijklmnopqrstuvwxyzT", 1 ], [ "The quick brown fox jumps over the lazy dog" x 100, "abcdefghijklmnopqrstuvwxyzTU", 0 ], ); # Short sentence, long wantedChars my @shortLongTestCases = ( [ "The quick brown fox jumps over the lazy dog", "abcdefghijklmnopqrstuvwxyzT"x100, 1 ], [ "The quick brown fox jumps over the lazy dog", "abcdefghijklmnopqrstuvwxyzTU"x100, 0 ], ); # Long sentence, long wantedChars my @longLongTestCases = ( [ "The quick brown fox jumps over the lazy dog" x 100, "abcdefghijklmnopqrstuvwxyzT"x100, 1 ], [ "The quick brown fox jumps over the lazy dog" x 100, "abcdefghijklmnopqrstuvwxyzTU"x100, 0 ], ); # VERY long test case my $alphabet=(join '' => 'a' .. 'z', 'A' .. 'Z'); my @veryLongTestCase = ( [ random(10000, $alphabet), $alphabet, 1 ], ); my @testCases = ( @shortTestCases, @shortLongTestCases, @longShortTestCases, @longLongTestCases, @veryLongTestCase ); sub random { my $num = shift; my $wantedChars = shift; my @chars = ('a' .. 'z', 'A' .. 'Z'); my $result; # make sure result will work do { $result = join '' => map $chars[rand @chars], 1 .. $num; } until Tanktalus_AllIndex($result, $wantedChars); return $result; } sub test_routine { my ($testFn, $testName)=@_; foreach(@testCases) { my ($sentence, $wantedLetters, $expectedResult) = @$_; ok (!!($testFn->($sentence, $wantedLetters)) == !!$expectedResu +lt); } } sub benchmark_routine { my ($testFn, $testName, $testCases)=@_; foreach(@$testCases) { my ($sentence, $wantedLetters, $expectedResult) = @$_; $testFn->($sentence, $wantedLetters) for 1..20; } } # [id://707122] sub tallulah_OriginalPost { my ($sentence, $wantedLetters)=@_; my $flag=0; my @a = split '',$wantedLetters; for( my $i=0; $i<$#a+1; $i++ ) { if($sentence !~ /$a[$i]/) { $flag=1;last; } } return !$flag; } # [id://707123] sub moritz_BuildRegex { my ($sentence, $wantedLetters)=@_; my $re = '^' . join '', map "(?=.*?$_)", map quotemeta, split m//, $wantedLetters; if ($sentence =~ m/$re/) { return 1; } return 0; } # [id://707123] sub moritz_BuildRegex_WithStudy { my ($sentence, $wantedLetters)=@_; my $re = '^' . join '', map "(?=.*?$_)", map quotemeta, split m//, $wantedLetters; study $sentence; if ($sentence =~ m/$re/) { return 1; } return 0; } # [id://707124] sub RMGir_index { my ($sentence, $wantedLetters)=@_; # don't need this variable (or any of them, in # fact -- they're just here for clarity. # we could work straight out of @_ if we wanted # this terser # Also, the $[ check is just pedantic - if someone # changes $[, shoot them. my $foundLetters=scalar (grep index($sentence,$_)>=$[, split //,$wantedLetters); return length($wantedLetters)==$foundLetters; } # [id://707222] sub Tanktalus_AllRegex { my ($sentence, $letters) = @_; return 1 unless length($letters); # all we're doing is checking for each letter. all { $sentence =~ $_ } split //, $letters; } # [id://707222] sub Tanktalus_AllRegex_Study { my ($sentence, $letters) = @_; return 1 unless length($letters); study $sentence; # all we're doing is checking for each letter. all { $sentence =~ $_ } split //, $letters; # same as above, but with index which I think is less readable. #all { index($sentence, $_) >= $[ } split //, $letters; } # [id://707222] sub Tanktalus_AllIndex { my ($sentence, $letters) = @_; return 1 unless length($letters); # same as above, but with index which I think is less readable. all { index($sentence, $_) >= $[ } split //, $letters; } # JavaFan's looks about equivalent to OP approach # Doesn't have same repeated letter semantics specified in # OP post. # [id://707176] sub oshalla_scan { my ($sentence, $wanted) = @_ ; while (length($wanted)) { return 0 if ($sentence !~ m/([$wanted])/g) ; $wanted =~ s/$1// ; } ; return 1; } # [id://707231] sub varian_hash { my ($sentence, $wantedLetters)=@_; my %required = map {$_ => 1} split //,$wantedLetters; map delete $required{$_}, split //, $sentence; if (keys %required) { return 0; } else { return 1; } } # [id://707314] sub RMGir_slice { my ($sentence, $wantedLetters)=@_; my %required; @required{split //,$wantedLetters}=(); delete @required{split //, $sentence}; if (keys %required) { return 0; } else { return 1; } } # [tassilo]'s test cases sub makeSubRef { return @_[1,0]; } my %testSubroutines=( makeSubRef(\&tallulah_OriginalPost, "tallulah_OriginalPost"), makeSubRef(\&moritz_BuildRegex, "moritz_BuildRegex"), makeSubRef(\&moritz_BuildRegex_WithStudy, "moritz_BuildRegex_WithStudy"), makeSubRef(\&RMGir_index, "RMGir_index"), makeSubRef(\&Tanktalus_AllRegex, "Tanktalus_AllRegex"), makeSubRef(\&Tanktalus_AllRegex_Study, "Tanktalus_AllRegex_Study"), makeSubRef(\&Tanktalus_AllIndex, "Tanktalus_AllIndex"), makeSubRef(\&varian_hash, "varian_hash"), makeSubRef(\&RMGir_slice, "RMGir_slice"), tassilo_listutils_r => sub { return 1 unless length($_[1]); all { rindex($_[0], $_) >= 0 } split //, $_[1]; }, repellent_unpack => sub { my ($sentence, $wantedLetters)=@_; my $foundLetters=scalar (grep index($sentence,$_)>=$[, unpack "(a)*", $wantedLetters); return length($wantedLetters)==$foundLetters; }, repellent_unpack_opt => sub { length($_[1])==scalar (grep index($_[0],$_)>=$[, unpack "(a)*", $_[1]); }, unpack_allindex => sub { return 1 unless length($_[1]); my ($sentence, $wantedLetters)=@_; all { index($sentence, $_) >= $[ } unpack "(a)*", $wantedLetters; }, unpack_allrindex => sub { return 1 unless length($_[1]); my ($sentence, $wantedLetters)=@_; all { rindex($sentence, $_) >= $[ } unpack "(a)*", $wantedLetters; }, tye2_opt => sub { while( $_[1] =~ /(.)/gs ) { return 0 if -1 == index($_[0],$1); } return 1; }, tye1_opt => sub { -1 == index($_[0],$1) && return 0 while( $_[1] =~ /(.)/gs ); return 1; }, tye0_opt => sub { -1 == index( $_[0], $1 ) && return 0 while( $_[1] =~ /(.)/gs ); return 1; }, tye2 => sub { my( $sentence, $wantedLetters )= @_; while( $wantedLetters =~ /(.)/gs ) { return 0 if -1 == index($sentence,$1); } return 1; }, tye1 => sub { my( $sentence, $wantedLetters )= @_; -1 == index($sentence,$1) && return 0 while( $wantedLetters =~ /(.)/gs ); return 1; }, tye0 => sub { -1 == index( $_[0], $1 ) && return 0 while( $_[1] =~ /(.)/gs ); return 1; }, # FAILS the "0" "0" test.... buk => sub { my( $s, $w ) = @_; my $c; 1+index $s, $c or return 0 while $c = chop $w; 1; }, buk_substr => sub { my( $s, $w ) = @_; 1+index $s, substr($w,$_,1) or return 0 foreach 0..length($w); 1; }, # FAILS the "0" "0" test.... buk2 => sub { local $_; 1+index $_[0], $_ or return while $_ = chop $_[1]; 1; }, buk2_len => sub { local $_; 1+index $_[0], $_ or return while length($_ = chop $_[1]); 1; }, buk3 => sub { 1+index $_[0], chop $_[ 1 ] or return for 1 .. length $_[ 1 ]; 1; }, buk4 => sub { ( -1 != index $_[0], chop $_[ 1 ] ) || return for 1 .. length +$_[ 1 ]; 1; }, ysth_loookahead => sub { my ($sentence, $wantedLetters) = @_; $wantedLetters =~ s/(.)/(?=.*?\Q$1\E)/sg; $sentence =~ /^$wantedLetters/s; } # # These 2 routines need more debugging - they fail the tests # tassilo_xor => sub { # return 1 unless length($_[1]); # my $copy = $_[0]; # for (split //, $_[1]) { # my $mask = ($_) x length $copy; # $copy ^= $mask; # $copy =~ tr/\000//d; # $copy ^= ( ($_) x length $copy ); # }; # length($copy) == 0; # }, # tassilo_tr => sub { # return 1 unless length($_[1]); # my $copy = $_[0]; # eval "\$copy =~ tr/$_[1]//d"; # length($copy) == 0; # } ); if(@ARGV && $ARGV[0] eq "-t") { use Test::More; plan tests => ((scalar keys %testSubroutines) * scalar @testCases) +; print "Testing routines...\n"; foreach my $name(sort keys %testSubroutines) { print "Testing $name\n"; test_routine($testSubroutines{$name}, $name); } exit(0); } print "Running benchmarks...\n"; my $testsRef = \@shortTestCases; my $benchmark_routines={ map { ($_, eval qq[sub { benchmark_routine(\$testSubroutines{"$_"}, "$_", \$testsRef); } ]) } keys %testSubroutines }; print "Short \n"; cmpthese(-1, $benchmark_routines); $testsRef = \@longShortTestCases; print "\n\n"; print "LongShort \n"; cmpthese(-1, $benchmark_routines); $testsRef = \@shortLongTestCases; print "\n\n"; print "ShortLong \n"; cmpthese(-1, $benchmark_routines); $testsRef = \@longLongTestCases; print "\n\n"; print "LongLong \n"; cmpthese(-1, $benchmark_routines); $testsRef = \@veryLongTestCase; print "\n\n"; print "VeryLong \n"; cmpthese(-1, $benchmark_routines);

Mike

Comment on Benchmarking "Are all these characters in this sentence?"
Download Code
Re: Benchmarking "Are all these characters in this sentence?"
by moritz (Cardinal) on Aug 28, 2008 at 23:05 UTC
    I see two interesting variables that you could change - the length of the string (and not just up to about 20, but perhaps up to 20k characters), and the number of strings that each set of characters is tested on.

    Currently the comparison is just a random point of data in the big, two-dimensional benchmark space ;-)

      Well, to get really thorough, we should also have some utf-8 string tests. But I haven't done anything with utf-8 in perl, so I'll leave writing those cases to someone else.

      I've added a few 3k-ish character search strings, and a few 3k-ish character character sets... Doing the long cases moves the results more in favour of Tanktalus_AllIndex:

      Short sentence and search set cases
      
      tallulah_OriginalPost            2054/s
      Tanktalus_AllRegex               2511/s
      Tanktalus_AllRegex_Study         2522/s
      moritz_BuildRegex_WithStudy      2595/s
      moritz_BuildRegex                2715/s
      varian_hash                      2983/s
      RMGir_slice                      4035/s
      Tanktalus_AllIndex               8219/s
      RMGir_index                     12107/s
      
      
      Long sentence and Short search set cases
      
      varian_hash                      97.2/s
      RMGir_slice                       115/s
      moritz_BuildRegex_WithStudy      3172/s
      tallulah_OriginalPost            3230/s
      Tanktalus_AllRegex_Study         3319/s
      Tanktalus_AllRegex               4054/s
      moritz_BuildRegex                4250/s
      Tanktalus_AllIndex              13032/s
      RMGir_index                     17612/s
      
      Short sentence and Long search set cases
      moritz_BuildRegex_WithStudy      54.1/s
      moritz_BuildRegex                54.6/s
      tallulah_OriginalPost            63.6/s
      Tanktalus_AllRegex               86.9/s
      Tanktalus_AllRegex_Study         87.1/s
      varian_hash                       161/s
      RMGir_index                       285/s
      RMGir_slice                       319/s
      Tanktalus_AllIndex                320/s
      
      Long sentence and Long search set cases
      moritz_BuildRegex_WithStudy      54.1/s
      moritz_BuildRegex                54.6/s
      tallulah_OriginalPost            63.6/s
      varian_hash                      86.9/s
      Tanktalus_AllRegex_Study         87.7/s
      Tanktalus_AllRegex               87.7/s
      RMGir_slice                       135/s
      RMGir_index                       250/s
      Tanktalus_AllIndex                319/s
      
      Here's the benchmark code with the added data points:

      Mike
Re: Benchmarking "Are all these characters in this sentence?"
by Jenda (Abbot) on Aug 29, 2008 at 08:57 UTC

    I tried a little modified version of the RMGir_index one and the results are ... random. Especially since you do a lot more work in the benchmark than just the test you want to compare, the results depend on too many things and the actual different code seems to play only small part of it. I think you should at least change the benchmark_routine() to something like

    sub benchmark_routine { my ($testFn, $testName)=@_; foreach(@testCases) { my ($sentence, $wantedLetters, $expectedResult) = @$_; $testFn->($sentence, $wantedLetters) for (1..20); die "$testName test failed ($sentence, $wantedLetters)" unless (($testFn->($sentence, $wantedLetters))==$expectedRes +ult); } }
    to give a little more weight to the tested subroutine. Not that it helped too much in this case.

    sub RMGir_indexNot { my ($sentence, $wantedLetters)=@_; # don't need this variable (or any of them, in # fact -- they're just here for clarity. # we could work straight out of @_ if we wanted # this terser # Also, the $[ check is just pedantic - if someone # changes $[, shoot them. return !(grep index($sentence,$_)<$[, split //,$wantedLetters); }
      Good point. My assumption when I wrote the code was that the testFn routines were going to be fairly expensive, so benchmark_routine wasn't unacceptable overhead.

      I should have written separate test and benchmark routines, rather than let the benchmark run and check my test cases.

      Re-running the tests with the 1..20 fix you suggested does alter the percentages between tests, but doesn't change the relative rankings much.

      Short
      
      tallulah_OriginalPost             107/s
      Tanktalus_AllRegex_Study          124/s
      Tanktalus_AllRegex                130/s
      varian_hash                       152/s
      RMGir_slice                       207/s
      moritz_BuildRegex_WithStudy       243/s
      moritz_BuildRegex                 261/s
      Tanktalus_AllIndex                452/s
      RMGir_index                       678/s
      
      
      LongShort                   
      
      varian_hash                      4.76/s
      RMGir_slice                      5.77/s
      tallulah_OriginalPost             166/s
      Tanktalus_AllRegex_Study          169/s
      Tanktalus_AllRegex                207/s
      moritz_BuildRegex_WithStudy       241/s
      moritz_BuildRegex                 395/s
      Tanktalus_AllIndex                731/s
      RMGir_index                      1000/s
      
      
      ShortLong
      
      tallulah_OriginalPost            3.15/s
      Tanktalus_AllRegex_Study         4.27/s
      Tanktalus_AllRegex               4.31/s
      moritz_BuildRegex_WithStudy      5.22/s
      moritz_BuildRegex                5.26/s
      varian_hash                      7.84/s
      RMGir_index                      13.9/s
      RMGir_slice                      15.3/s
      Tanktalus_AllIndex               15.8/s
      
      
      LongLong
      
      tallulah_OriginalPost            3.17/s
      Tanktalus_AllRegex_Study         4.24/s
      varian_hash                      4.35/s
      Tanktalus_AllRegex               4.35/s
      moritz_BuildRegex                5.17/s
      moritz_BuildRegex_WithStudy      5.22/s
      RMGir_slice                      6.60/s
      RMGir_index                      12.5/s
      Tanktalus_AllIndex               15.5/s
      
      

      Mike
Re: Benchmarking "Are all these characters in this sentence?"
by repellent (Priest) on Aug 29, 2008 at 22:18 UTC
    unpack()ing the letters, instead of split()ting them, may boost the performance slightly:
    sub RMGir_index { my ($sentence, $wantedLetters)=@_; ## comment removed ## my $foundLetters=scalar (grep index($sentence,$_)>=$[, unpack "(a)*", $wantedLetters); return length($wantedLetters)==$foundLetters; }
      repellent++ !!

      I never thought of that.

      Happily, JUST before I saw that you posted that, I posted a new version of the code that makes adding variants easy, so I added your variant, as well as variants using unpack + List::MoreUtils, including a new variant from tassilo.

      Your unpack approach wins!! Thanks for pointing that option out...

      The parent post is updated with the new code and the new results.


      Mike

      Rather than change the tiny operation used to generate a list of letters, just avoid generating a list of letters at all:

      sub tye2 { my( $sentence, $wantedLetters )= @_; while( $wantedLetters =~ /(.)/gs ) { return 0 if -1 == index($sentence,$1); } return 1; }

      If you want to go for ugly code for the sake of micro-optimizations, then

      sub tye1 { my( $sentence, $wantedLetters )= @_; -1 == index($sentence,$1) && return 0 while( $wantedLetters =~ /(.)/gs ); return 1; }

      Or get even uglier to the point of risking improper behavior in some cases:

      sub tye0 { -1 == index( $_[0], $1 ) && return 0 while( $_[1] =~ /(.)/gs ); return 1; }

      - tye        

        Very nice!

        Except for the "VeryLong" test case, one of those wins all the other benchmarks. For the short charset cases, tye0 wins, while for the others, tye1 and tye2 are tied :) and slightly better than tye0.

        And on the VeryLong case, repellent's unpack win, but the 2nd place results look like a wash between several of the approaches, and yours are quite competitive.

        I've updated the parent node with your subs, thanks!


        Mike
Re: Benchmarking "Are all these characters in this sentence?" (quickest yet?)
by BrowserUk (Pope) on Aug 30, 2008 at 17:32 UTC
      This IS fast, but if I run this in my test harness, for some reason it destroys the @testArray contents.

      Aha, found it. $_ isn't localized, so using it that way trashes the array. @#$%@#$.

      buk => sub { my( $s, $w ) = @_; my $c; 1+index $s, $c or return 0 while $c = chop $w; 1; }, works just fine.


      Mike
Re: Benchmarking "Are all these characters in this sentence?"
by rhesa (Vicar) on Aug 30, 2008 at 19:01 UTC
    Here are the results using perl, v5.10.0 built for x86_64-linux. Interesting changes in ranking.

    I'm surprised tye's /./gs solution took such a hit. The unpack and index solutions did get a bit faster, but not much. Weren't regexes supposed to have gotten a bit faster in 5.10?

    Short Rate tallulah_OriginalPost tallulah_OriginalPost 114/s -- Tanktalus_AllRegex 160/s 40% Tanktalus_AllRegex_Study 161/s 41% varian_hash 190/s 66% moritz_BuildRegex_WithStudy 293/s 157% moritz_BuildRegex 313/s 174% RMGir_slice 345/s 203% tye2 426/s 273% tye1 426/s 273% tye0 570/s 399% unpack_allindex 598/s 424% unpack_allrindex 599/s 425% repellent_unpack 710/s 522% Tanktalus_AllIndex 710/s 522% RMGir_index 881/s 671% tassilo_listutils_r 1037/s 808% buk 1280/s 1021%
      I'm having a lot of fun with this... I was so sure slicing was going to beat everything else, how wrong!

      I expected 5.10.0 to shuffle things, but not this way, thanks!


      Mike
Re: Benchmarking "Are all these characters in this sentence?"
by ysth (Canon) on Aug 31, 2008 at 17:22 UTC
    Re: chop, the "buk" solution fails on this test:
    [ "abxcd zwe rrv", "0", 0 ],
    because 0 is false.

    On a more conceptual level, it fails in that the return value of chop on an empty string is not documented. But in practice, this would be the fix:

    1+index $s, $c or return 0 while length($c = chop $w);
      Good point!

      I added that test case, as well as a

      [ "0", "0", 1],
      to test for those pesky perl falsehood values.

      buk and buk2 both pass this new test and fail yours. (Edit: Corrected, thanks ysth)

      Adding length makes buk2 correct, and doesn't penalize the speed much:

      buk2_len                    4610/s                 4007%
      buk2                        4791/s                 4168%  
      

      Mike

        Try it this way. I believe it addresses both the real and conceptual errors and runs even faster:

        buk3 => sub { 1+index $_[0], chop $_[ 1 ] or return for 1 .. length $_[ 1 ]; 1; },

        Who says optimisation can't also result in correct code. Such fun :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Benchmarking "Are all these characters in this sentence?"
by ysth (Canon) on Aug 31, 2008 at 17:23 UTC
Re: Benchmarking "Are all these characters in this sentence?"
by martin (Pilgrim) on Sep 05, 2008 at 14:33 UTC
    Just for completeness, try this one, too:
    martin => sub { # args: sentence, wantedchars '' eq $_[1] || '' ne $_[0] && $_[1] =~ /^[\Q$_[0]\E]*\z/; },
    Putting the whole sentence into a character class will kind of suck with veeery long sentences but medium-sized cases should benefit from the one-pass approach.

    Update: fixed the order of subroutine parameters

      You're checking whether only the wantedchars are used.

      That wouldn't work. The question isn't whether the sentence consists ONLY of characters in the wantedchars, just whether or not ALL the wantedchars are present.


      Mike
        You're checking whether only the wantedchars are used.

        Not quite. It's supposed to be the other way round. The idea is to check whether $wantedchars only contains characters from $sentence.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://707621]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-04-21 05:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (490 votes), past polls