Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Fast, Efficient Union and Intersection on arrays

by moritz (Cardinal)
on Nov 20, 2008 at 16:45 UTC ( [id://724920]=note: print w/replies, xml ) Need Help??


in reply to Fast, Efficient Union and Intersection on arrays

If you know that each array itself doesn't contain duplicate elements, you could try something along these lines:
my (@a, @b); # initialize @a and @b here my %hash: @hash{@a} = (1) x @a; my @union = @a, grep { !$hash{$_} } @b; my @intersection = grep { $hash{$_} } @b;

That has the advantage of using only one hash, which might buy you some performance. But I don't know if it is actually faster than anything else.

(If you provide a small Benchmark with lists of typical sizes you might get much better answers; for example for a large number of small integers it might be more memory efficient to use vec than a hash, which might in turn result in some performance benefit.)

Replies are listed 'Best First'.
Re^2: Fast, Efficient Union and Intersection on arrays
by TGI (Parson) on Nov 20, 2008 at 18:43 UTC

    It seems counterproductive to grep the b array twice. But it's actually faster than using one for loop.

    I did a benchmark, and there is a 20% speed boost for doing two greps over just one for loop with short arrays (less than 10). The boost shrinks as the arrays grow in size. At array size ~50 boost is only about 8%.

    use strict; use warnings; use Benchmark qw(cmpthese clearallcache); my @a = (1..50); my @b = (3..80); #two_greps(); #one_for(); #exit; print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); clearallcache(); @a = (1..5); @b = (3..8); print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = @a, grep { !$hash{$_} } @b; my @intersection = grep { $hash{$_} } @b; # print "union @union\n"; # print "interstection @intersection\n"; } sub one_for { my %hash; @hash{@a} = (1) x @a; my @union = @a; my @intersection; foreach (@b) { if( exists $hash{$_} ) { push @intersection, $_ } else { push @union, $_ } } # print "union @union\n"; # print "interstection @intersection\n"; } __DATA__ Range (1,50) vs (3,80) Rate one_for two_greps one_for 5829/s -- -8% two_greps 6318/s 8% -- Range (1,5) vs (3,8) Rate one_for two_greps one_for 57143/s -- -17% two_greps 68823/s 20% --

    Thanks for posting this thought provoking code.


    TGI says moo

      I remember one time getting a noticeable speed improvement by using the 'for (...;...;...)' over a foreach loop for a large array. I wondering if that would have helped here.

        Once again, expectations are shattered: the c style for loop is the slowest.

        sub one_cfor { # my (@a, @b); # initialize @a and @b here my %hash; @hash{@a} = (1) x @a; my @union = @a; my @intersection; for (my $i=0; $i<$#b; $i++ ) { my $bval = $b[$i]; if( exists $hash{$bval} ) { push @intersection, $bval } else { push @union, $bval } } # print "union @union\n"; # print "interstection @intersection\n"; } ################ Range (1,50) vs (3,80) Rate one_cfor one_for two_greps one_cfor 4438/s -- -24% -30% one_for 5834/s 31% -- -8% two_greps 6312/s 42% 8% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50000/s -- -13% -29% one_for 57637/s 15% -- -18% two_greps 70323/s 41% 22% --


        TGI says moo

      But there is flaw in grep-algorithm: it does not union correctly, if first array is shorter than second. If you consider this, then the for-algorithm will about 6-7% faster.

      Nġnda, WK

        Good catch on the bum algorithm! I was able to fix the error by adding some parentheses. However, I still get a good advantage for the two grep method.

        sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = (@a, grep { !$hash{$_} } @b); my @intersection = ( grep { $hash{$_} } @b); } __DATA__ Range (1,3) vs (3,9) Rate one_cfor one_for two_greps one_cfor 55648/s -- -18% -23% one_for 68027/s 22% -- -5% two_greps 71891/s 29% 6% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50378/s -- -13% -15% one_for 58140/s 15% -- -2% two_greps 59277/s 18% 2% --

        Update: the reason for the error is that = has a higher precedence than ,. Therefore the bad version was like doing: (my @union = @a), grep { !$hash{$_} } @b;. The grep was executed, but the results were discarded.

        A more serious error with the greps is that we test the value in the hash, not its existence. So 0 would never be in the intersection of a set, and always in the union. Two greps still wins by a small amount after that is fixed, too.

        sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = (@a, grep { ! exists $hash{$_} } @b); my @intersection = ( grep { exists $hash{$_} } @b); } Range (1,3) vs (0,9) Rate one_cfor one_for two_greps one_cfor 50787/s -- -20% -25% one_for 63371/s 25% -- -7% two_greps 68027/s 34% 7% -- Range (1,5) vs (3,8) Rate one_cfor one_for two_greps one_cfor 50813/s -- -12% -18% one_for 57670/s 13% -- -7% two_greps 62150/s 22% 8% --


        TGI says moo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://724920]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-23 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found