http://www.perlmonks.org?node_id=724947


in reply to Re: Fast, Efficient Union and Intersection on arrays
in thread Fast, Efficient Union and Intersection on arrays

It seems counterproductive to grep the b array twice. But it's actually faster than using one for loop.

I did a benchmark, and there is a 20% speed boost for doing two greps over just one for loop with short arrays (less than 10). The boost shrinks as the arrays grow in size. At array size ~50 boost is only about 8%.

use strict; use warnings; use Benchmark qw(cmpthese clearallcache); my @a = (1..50); my @b = (3..80); #two_greps(); #one_for(); #exit; print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); clearallcache(); @a = (1..5); @b = (3..8); print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = @a, grep { !$hash{$_} } @b; my @intersection = grep { $hash{$_} } @b; # print "union @union\n"; # print "interstection @intersection\n"; } sub one_for { my %hash; @hash{@a} = (1) x @a; my @union = @a; my @intersection; foreach (@b) { if( exists $hash{$_} ) { push @intersection, $_ } else { push @union, $_ } } # print "union @union\n"; # print "interstection @intersection\n"; } __DATA__ Range (1,50) vs (3,80) Rate one_for two_greps one_for 5829/s -- -8% two_greps 6318/s 8% -- Range (1,5) vs (3,8) Rate one_for two_greps one_for 57143/s -- -17% two_greps 68823/s 20% --

Thanks for posting this thought provoking code.


TGI says moo

Replies are listed 'Best First'.
Re^3: Fast, Efficient Union and Intersection on arrays
by admiral_grinder (Pilgrim) on Nov 20, 2008 at 21:13 UTC
    I remember one time getting a noticeable speed improvement by using the 'for (...;...;...)' over a foreach loop for a large array. I wondering if that would have helped here.

      Once again, expectations are shattered: the c style for loop is the slowest.

      sub one_cfor { # my (@a, @b); # initialize @a and @b here my %hash; @hash{@a} = (1) x @a; my @union = @a; my @intersection; for (my $i=0; $i<$#b; $i++ ) { my $bval = $b[$i]; if( exists $hash{$bval} ) { push @intersection, $bval } else { push @union, $bval } } # print "union @union\n"; # print "interstection @intersection\n"; } ################ Range (1,50) vs (3,80) Rate one_cfor one_for two_greps one_cfor 4438/s -- -24% -30% one_for 5834/s 31% -- -8% two_greps 6312/s 42% 8% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50000/s -- -13% -29% one_for 57637/s 15% -- -18% two_greps 70323/s 41% 22% --


      TGI says moo

Re^3: Fast, Efficient Union and Intersection on arrays
by wanradt (Scribe) on Nov 22, 2008 at 10:44 UTC

    But there is flaw in grep-algorithm: it does not union correctly, if first array is shorter than second. If you consider this, then the for-algorithm will about 6-7% faster.

    Nġnda, WK

      Good catch on the bum algorithm! I was able to fix the error by adding some parentheses. However, I still get a good advantage for the two grep method.

      sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = (@a, grep { !$hash{$_} } @b); my @intersection = ( grep { $hash{$_} } @b); } __DATA__ Range (1,3) vs (3,9) Rate one_cfor one_for two_greps one_cfor 55648/s -- -18% -23% one_for 68027/s 22% -- -5% two_greps 71891/s 29% 6% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50378/s -- -13% -15% one_for 58140/s 15% -- -2% two_greps 59277/s 18% 2% --

      Update: the reason for the error is that = has a higher precedence than ,. Therefore the bad version was like doing: (my @union = @a), grep { !$hash{$_} } @b;. The grep was executed, but the results were discarded.

      A more serious error with the greps is that we test the value in the hash, not its existence. So 0 would never be in the intersection of a set, and always in the union. Two greps still wins by a small amount after that is fixed, too.

      sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = (@a, grep { ! exists $hash{$_} } @b); my @intersection = ( grep { exists $hash{$_} } @b); } Range (1,3) vs (0,9) Rate one_cfor one_for two_greps one_cfor 50787/s -- -20% -25% one_for 63371/s 25% -- -7% two_greps 68027/s 34% 7% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50813/s -- -12% -18% one_for 57670/s 13% -- -7% two_greps 62150/s 22% 8% --


      TGI says moo