perrin has asked for the
wisdom of the Perl Monks concerning the following question:
We all know the classic perl solutions for finding the unique elements in a list, as described in perlfaq4's "How can I remove duplicate elements from a list or array?" For your benefit:
undef %saw;
@out = grep(!$saw{$_}++, @in);
I have multiple large sets, which I need to find the union of. I'm wondering if there's anything better than this standard approach, where better means faster. (But little tweaks with "exists" or hashslices that don't change the basic algorithm aren't enough to be interesting.) It seems unlikely, but I thought maybe the fact that I can get the individual sets sorted ahead of time might allow some sneaky trick. Any ideas?
Re: better union of sets algorithm? by Zaxo (Archbishop) on Mar 11, 2005 at 05:25 UTC 
@saw{@in} = ();
@out = keys %saw;
The recentlyseenhere,
undef @saw{@in};
may be as fast, but it doesn't really sing out its intent.
 [reply] [d/l] [select] 
Re: better union of sets algorithm? by sgifford (Prior) on Mar 11, 2005 at 05:44 UTC 
If they're sorted in advance and you have a constant number of lists, you should be able to do it in linear time (instead of the n lg n that using a hash provides). The basic technique is to think of the lists of queues. Now search through all of these queues and find the lowest item at the queue head. Print it. If there are any identical items, they will also be at the queue heads; remove all of these items. Now find the next lowest item, and repeat.
Update: This assumes you can get the lists sorted for free, or at least for cheaper than the n lg n it would take to process the hashes. If that's the case, here's some simple code that demonstrates the idea, using an object to make the idea clearer:
#!/usr/bin/perl
use warnings;
use strict;
my @a1 = sort qw(foo bar blarn schmark floogle foo blarn);
my @a2 = sort qw(flirp schnirp blarn floogle florn flimple flange);
my @a3 = sort qw(bar bar floogle bar florn bar bar);
my $sml = SortedMultiList>new(\@a1,\@a2,\@a3);
my $last;
while (defined(my $next = $sml>shift_lowest))
{
print $next,"\n"
if (!defined($last) or ($last ne $next));
$last = $next;
}
package SortedMultiList;
sub new
{
my $class = shift;
my $self = [@_];
bless $self, $class;
}
sub shift_lowest
{
my $self = shift;
my($lowest_arr,$lowest);
foreach my $a (@$self)
{
next unless (@$a);
if (!defined($lowest) or ($a>[0] lt $lowest))
{
$lowest_arr = $a;
$lowest = $a>[0];
}
}
return $lowest_arr
? shift(@$lowest_arr)
: undef;
}
 [reply] [d/l] [select] 

You can do slightly better than what you just suggested for a mergesort of k lists (when k is not supersmall). If you have k sorted lists, pop the first item off each and put them into a heap, with some indication of which list they came from. Take off the lowest item off the heap, if it's not a duplicate of the item we last saw, put it into the output list. Pop the list this item came from and put it into the heap. Repeat.
This gives you an O(n log k) mergesort of k lists (where n is the total number of elements in all the lists). Blindly searching through the first elements of each list as you suggest gives an O(nk) algorithm. Of course, when the number of lists is constant, they are both still O(n), but the heap version will have a lower constant (unless k is very small).
What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.
I'm afraid perrin won't be able to better than the linear time algorithm he presents (other than small tweaks of the constant, which he explicitly expressed disinterest in). Think about it, how could you remove duplicates in the general case without at least looking at each element? Yes, there are surely some obscene special cases where you might do better at first, but it's a safe bet that you're going to copy the result into an array somewhere which takes O(n) anyway...
 [reply] 

100000: 446368.754 inserts/sec, 1050330.051 lookups/sec
200000: 424977.633 inserts/sec, 821176.759 lookups/sec
400000: 398651.394 inserts/sec, 805369.896 lookups/sec
800000: 399463.754 inserts/sec, 787853.768 lookups/sec
1600000: 390381.775 inserts/sec, 766870.015 lookups/sec
3200000: 377860.715 inserts/sec, 720909.739 lookups/sec
Then again, my recollections of algorithmic amortizations are hazy at best, so feel free to convince me that I'm wrong. :)
 [reply] [d/l] [select] 



 


instead of the n lg n that using a hash provides
You are wrong here. The expected amortized time for hash operations is O(k), where k is the length of the string (it's the time that's needed to compute the hash value). Note the absense of a dependency on the size of a hash. This means that if you do N hash operations (insertions, deletions, fetches) and all your strings are bounded in length by some constant, your expected time is O(N).
Now, an individual insertion in a hash may take O(n) (where n is the size of the hash). But this typically happens only after Ω(n) insertions (due to the hash filling up and a new hash needs to be build), so that amortizes out.
An individual search may take Ω(n) time because a
fraction of your keys hash to the same bucket. That would lead to a worse case quadratic performance. But the chances of that happening are so small, it doesn't have an impact on the expected running times. And ever since 5.8.1, Perl has some safety mechanisms (turned on by default) that even prevents you from constructing an input sequence that hashes the keys to the same bucket, because that hash function uses a different values on each run.
So, for all practical purposes, hash operations are done in constant time.
 [reply] 
Re: better union of sets algorithm? by BrowserUk (Pope) on Mar 11, 2005 at 06:39 UTC 
If your values are integers in a sane range (ie. reasonably low) you can save a little time, upto around 60% or so, by using a bit vector instead of a hash:
#! perl slw
use strict;
use Benchmark qw[ cmpthese ];
our $B = 32;
our $N = 1000;
our $S = 100;
our $R = $N * $S;
our @sets = map{ [ map int rand $R, 1 .. $N ] } 1 .. $S;
our( @hUniq, @vUniq );
cmpthese 2, {
hash => q[
my %seen;
undef @hUniq;
@hUniq = grep{ $seen{ $_ }++ } @$_ for @sets;
],
vec => q[
my $vector = '';
undef @vUniq;
@vUniq = grep{ vec( $vector, $_, $B )++ } @$_ for @sets;
],
};
print 'H:'. @hUniq, ' V: '. @vUniq;
__END__
P:\test>438536
Rate hash vec
hash 2.69/s  23%
vec 3.47/s 29% 
H:954 V: 954
P:\test>438536 S=20
Rate hash vec
hash 16.4/s  41%
vec 27.5/s 68% 
H:639 V: 639
P:\test>438536 S=20 N=100
Rate hash vec
hash 264/s  21%
vec 336/s 27% 
H:61 V: 61
P:\test>438536 S=20 N=2000
Rate hash vec
hash 7.33/s  36%
vec 11.5/s 56% 
H:1407 V: 1407
P:\test>438536 S=20 N=2000 B=16
Rate hash vec
hash 7.06/s  40%
vec 11.8/s 67% 
H:1399 V: 1399
If you could build and manipulate your sets as bit vectors, by storing your uniq values in an hash as you get them and setting bits in the vectors to represent them, ORing the bit vectors will get your union. You then use the setbit positions as indexes into your array of unique values to reconstitute the union set.
It works for any type of values and is very fast provided you can build and work with your sets that way.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
 [reply] [d/l] 

Only if they are nonnegative integers....
(Well, theoretically, for any set for which you have a fast, 1to1 mapping to the set of nonnegative integers, this method will work. For instance, if all your members are negative integers, multiplying all members with 1 gives you positive integers to put inside the bit string)
 [reply] 

 [reply] [d/l] [select] 

I had the same thought when reading through Mastering Algorithms with Perl, but I'm actually working with strings here. I can't think of a cheap way to map strings to bits that would work for this. It may be worth trying to keep track of integers for these since I could probably do it cheaply as they are added to my database and then the union could be done this way.
 [reply] 

Yes. The mapping is the crux of the issue.
If your doing the unions (or intersections, sym.diffs), on a regular basis, then it can be worth the effort of building a uniq index (offline) and replacing your sets of strings with bitvectors mapped against that index.
You then hold and maintain your sets as bitvectors and all the set manipulations become easy and efficient, except adding (and to lesser extent displaying), which requires mapping.
The index doesn't need to be ordered in any way, just unique. though ordering them does allow for the use of a binary chop for lookups when adding (or displaying).
Whether the offline work of mapping can be amortised to effect an overall saving depends on how often your sets change and how often you need to do the unions.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
 [reply] 
Re: better union of sets algorithm? by demerphq (Chancellor) on Mar 11, 2005 at 08:43 UTC 
I think that you might try something like this:

Modify the keys such that they store both the real key and the set they come from, something like "$key$set" might work. The values used for $set should be something like "A", "B" etc.. IE, sortable.

Sort the keys from the sets together. (If the lists are already sorted skip this set and just use a merge instead of a scan in the following step.)

Scan the set. If there are duplicate keys from the same set they will be together. Ignore one of them. Count the dupe keys with different set values (they will be inorder, you only need to worry about dupes). If the number of keys is the same as the number of sets then you have a union.
Whether this is faster will probably come down to data size. For really high volumes a hash approach wont work as it wont be sufficient memory efficient. (Remember a hash will normally have the same number of buckets as the next power of two larger than the number of keys in the hash.) For integers I would use BrowserUk's approach, for strings I would probably use a hash unless the data volume was really high (or i could be guaranteed the data was already in sorted form) and then i would go with something like the above.
 [reply] 

Thanks, but I think you're talking about an intersection, not a union. I want to get a unique list of all items that are in any set.
 [reply] 

so, for example (to clarify):
my @a = qw( 1 2 3 3 2 );
my @b = qw( 1 4 5 5 4 );
my @c = qw( 1 6 7 7 6 );
should give back
1 2 3 4 5 6 7
right? basically you could make one big list and find the uniques from that?

It's sad that a family can be torn apart by such a such a simple thing as a pack of wild dogs
 [reply] [d/l] [select] 


 [reply] 


Re: better union of sets algorithm? by JediWizard (Deacon) on Mar 11, 2005 at 15:58 UTC 
This comes from a standard utility library I wrote for myself years ago. uniqStrings will work faster, but if your data contains references they will be broken. uniqValues will work even if your list contains references.
#___________________________________
# Function: uniqStrings()
# Author: Dave Mueller
# Returns: @uniq
# Parameters: @array
# Description: Returns a unique list of all Strings found in @array
# Note: references wil be broken . . .
#___________________________________
sub uniqStrings
{
my %temp = ();
@temp{@_} = ();
return keys %temp;
}
#___________________________________
# Function: uniqValues()
# Author: Dave Mueller
# Returns: @uniq
# Parameters: @array
# Description: Returns a unique list of all values found in @array
#___________________________________
sub uniqValues
{
my %temp = ();
@temp{@_} = (1 .. scalar(@_));
return map({$_[$_]} values %temp);
}
A truely compassionate attitude towards other does not change, even if they behave negatively or hurt you —His Holiness, The Dalai Lama
 [reply] [d/l] 

Thanks, but that's the same as the one I quoted from perlfaq4.
 [reply] 
Re: better union of sets algorithm? by injunjoel (Priest) on Mar 11, 2005 at 20:14 UTC 
 [reply] [d/l] [select] 

Thanks, but that's an intersection, not a union.
 [reply] 
Re: better intersection of sets algorithm? by gam3 (Curate) on Mar 11, 2005 at 20:23 UTC 
This Code is for intersection for union code see Re^3: better union of sets algorithm?
It seems that you can get a lot of improvement.
Here are two tests I ran. The first is using two sets each with about half of /usr/dict/words and the third having the word zoo in it.
Union wineglasses zoo
Union wineglasses zoo
(warning: too few iterations for a reliable count)
s/iter sorted unsorted
sorted 4.34  78%
unsorted 0.956 354% 
Intersection wineglasses zoo
Intersection wineglasses zoo
The data for this run is
@a1 = qw (a b c d g);
@a2 = qw (a b c e d);
@a3 = qw (a b c f g);
Intersection a b c
Intersection a b c
Rate unsorted sorted
unsorted 205/s  93%
sorted 2775/s 1251% 
Intersection a b c
Intersection a b c
So even with a small data set the sorted method is faster.
For both methods the sets must have unique elements (as all
set do).
sub intersection_1
{
my $count = shift;
$count;
my %saw;
grep($count == $saw{$_}++, @_);
}
# This code is based on the SortedMultiList code but works
# for an intersection of more than two sets.
sub intersection_2
{
my $self = [@_];
my ($lowest_arr, $lowest, $next);
my @ret = ();
my $elements = scalar @_;
while (1) {
my $count = 0;
$lowest = $next;
if (!defined($lowest)) {
# find min element in queue
foreach my $a (@$self)
{
return @ret unless (@$a);
if (!defined($lowest) or ($a>[0] lt $lowest))
{
$lowest = $a>[0];
}
}
}
foreach my $a (@$self)
{
return @ret unless (@$a);
if ($a>[0] eq $lowest)
{
shift @$a;
$count++;
} else {
if (!defined($next) or ($a>[0] lt $next))
{
$next = $a>[0];
}
}
}
if ($count == $elements) {
push(@ret, $lowest);
}
$lowest = $next;
undef $next;
}
}
Here is the Benchmark code. How the argments are passed may have some effect on the speed. intersection_2 modifies the arrays so I do the copies to make the test more fair.
cmpthese 10, {
unsorted => q[
my @x1 = @a1;
my @x2 = @a2;
my @x3 = @a3;
my @out = intersection_1(3, @x1, @x2, @x3);
],
sorted => q[
my @x1 = @a1;
my @x2 = @a2;
my @x3 = @a3;
my @out = intersection_2(\@x1, \@x2, \@x3);
]
};
The SortedMultiList code is at Re: better union of sets algorithm?.
A picture is worth a thousand words, but takes 200K.
 [reply] [d/l] 

Thanks, but it looks like your code is computing the intersection of the sets, not the union.
 [reply] 

Yes, you are correct. How embaracing.
The timing remains about the same though.
Union_1 96235
Union_2 96235
(warning: too few iterations for a reliable count)
s/iter sorted unsorted
sorted 4.39  66%
unsorted 1.48 196% 
Union_1 96235
Union_2 96235
Union a b c d g e f
Union a b c d e f g
Rate unsorted sorted
unsorted 202/s  92%
sorted 2511/s 1144% 
Union a b c d g e f
Union a b c d e f g
sub union_1
{
my $count = shift;
$count;
my %saw;
grep(0 == $saw{$_}++, @_);
}
sub union_2
{
my $self = [@_];
my ($lowest_arr, $lowest, $next);
my @ret = ();
my $elements = scalar @_;
while (1) {
my $count = 0;
my $work = 0;
my $lowest;
foreach my $a (@$self)
{
next unless (@$a);
if (!defined($lowest) or ($a>[0] le $lowest))
{
$lowest = $a>[0];
}
}
foreach my $a (@$self)
{
next unless (@$a);
if ($a>[0] eq $lowest)
{
shift @$a;
$work++;
}
}
return @ret unless $lowest;
push(@ret, $lowest);
}
}
A picture is worth a thousand words, but takes 200K.
 [reply] [d/l] 

