Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Perl Monk, Perl Meditation
 
PerlMonks  

better union of sets algorithm?

by perrin (Chancellor)
on Mar 11, 2005 at 05:10 UTC ( #438536=perlquestion: print w/ replies, xml ) Need Help??
perrin has asked for the wisdom of the Perl Monks concerning the following question:

We all know the classic perl solutions for finding the unique elements in a list, as described in perlfaq4's "How can I remove duplicate elements from a list or array?" For your benefit:
undef %saw; @out = grep(!$saw{$_}++, @in);
I have multiple large sets, which I need to find the union of. I'm wondering if there's anything better than this standard approach, where better means faster. (But little tweaks with "exists" or hash-slices that don't change the basic algorithm aren't enough to be interesting.) It seems unlikely, but I thought maybe the fact that I can get the individual sets sorted ahead of time might allow some sneaky trick. Any ideas?

Comment on better union of sets algorithm?
Download Code
Re: better union of sets algorithm?
by Zaxo (Archbishop) on Mar 11, 2005 at 05:25 UTC

    I think that assigning a slice to the empty list is probably as good as it gets.

    @saw{@in} = (); @out = keys %saw;
    The recently-seen-here, undef @saw{@in};
    may be as fast, but it doesn't really sing out its intent.

    After Compline,
    Zaxo

Re: better union of sets algorithm?
by sgifford (Prior) on Mar 11, 2005 at 05:44 UTC
    If they're sorted in advance and you have a constant number of lists, you should be able to do it in linear time (instead of the n lg n that using a hash provides). The basic technique is to think of the lists of queues. Now search through all of these queues and find the lowest item at the queue head. Print it. If there are any identical items, they will also be at the queue heads; remove all of these items. Now find the next lowest item, and repeat.

    Update: This assumes you can get the lists sorted for free, or at least for cheaper than the n lg n it would take to process the hashes. If that's the case, here's some simple code that demonstrates the idea, using an object to make the idea clearer:

    #!/usr/bin/perl use warnings; use strict; my @a1 = sort qw(foo bar blarn schmark floogle foo blarn); my @a2 = sort qw(flirp schnirp blarn floogle florn flimple flange); my @a3 = sort qw(bar bar floogle bar florn bar bar); my $sml = SortedMultiList->new(\@a1,\@a2,\@a3); my $last; while (defined(my $next = $sml->shift_lowest)) { print $next,"\n" if (!defined($last) or ($last ne $next)); $last = $next; } package SortedMultiList; sub new { my $class = shift; my $self = [@_]; bless $self, $class; } sub shift_lowest { my $self = shift; my($lowest_arr,$lowest); foreach my $a (@$self) { next unless (@$a); if (!defined($lowest) or ($a->[0] lt $lowest)) { $lowest_arr = $a; $lowest = $a->[0]; } } return $lowest_arr ? shift(@$lowest_arr) : undef; }
      You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small). If you have k sorted lists, pop the first item off each and put them into a heap, with some indication of which list they came from. Take off the lowest item off the heap, if it's not a duplicate of the item we last saw, put it into the output list. Pop the list this item came from and put it into the heap. Repeat.

      This gives you an O(n log k) mergesort of k lists (where n is the total number of elements in all the lists). Blindly searching through the first elements of each list as you suggest gives an O(nk) algorithm. Of course, when the number of lists is constant, they are both still O(n), but the heap version will have a lower constant (unless k is very small).

      What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.

      I'm afraid perrin won't be able to better than the linear time algorithm he presents (other than small tweaks of the constant, which he explicitly expressed disinterest in). Think about it, how could you remove duplicates in the general case without at least looking at each element? Yes, there are surely some obscene special cases where you might do better at first, but it's a safe bet that you're going to copy the result into an array somewhere which takes O(n) anyway...

      blokhead

        You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
        Yes, but it's much harder to code off the top of my head. :)

        What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.
        My understanding has always been that hash operations over n items take lg n time, since in general a hash will divide up n items into n / lg n buckets, then do a linear search through the appropriate bucket.

        Some experiments seem to bear this out: as I increase the number of elements in a hash, the number of operations per second I can do decreases. This wouldn't happen if hash operations took constant time.

        100000: 446368.754 inserts/sec, 1050330.051 lookups/sec 200000: 424977.633 inserts/sec, 821176.759 lookups/sec 400000: 398651.394 inserts/sec, 805369.896 lookups/sec 800000: 399463.754 inserts/sec, 787853.768 lookups/sec 1600000: 390381.775 inserts/sec, 766870.015 lookups/sec 3200000: 377860.715 inserts/sec, 720909.739 lookups/sec

        Then again, my recollections of algorithmic amortizations are hazy at best, so feel free to convince me that I'm wrong. :)

      instead of the n lg n that using a hash provides
      You are wrong here. The expected amortized time for hash operations is O(k), where k is the length of the string (it's the time that's needed to compute the hash value). Note the absense of a dependency on the size of a hash. This means that if you do N hash operations (insertions, deletions, fetches) and all your strings are bounded in length by some constant, your expected time is O(N).

      Now, an individual insertion in a hash may take O(n) (where n is the size of the hash). But this typically happens only after Ω(n) insertions (due to the hash filling up and a new hash needs to be build), so that amortizes out.

      An individual search may take Ω(n) time because a fraction of your keys hash to the same bucket. That would lead to a worse case quadratic performance. But the chances of that happening are so small, it doesn't have an impact on the expected running times. And ever since 5.8.1, Perl has some safety mechanisms (turned on by default) that even prevents you from constructing an input sequence that hashes the keys to the same bucket, because that hash function uses a different values on each run.

      So, for all practical purposes, hash operations are done in constant time.

Re: better union of sets algorithm?
by BrowserUk (Pope) on Mar 11, 2005 at 06:39 UTC

    If your values are integers in a sane range (ie. reasonably low) you can save a little time, upto around 60% or so, by using a bit vector instead of a hash:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $B ||= 32; our $N ||= 1000; our $S ||= 100; our $R ||= $N * $S; our @sets = map{ [ map int rand $R, 1 .. $N ] } 1 .. $S; our( @hUniq, @vUniq ); cmpthese -2, { hash => q[ my %seen; undef @hUniq; @hUniq = grep{ $seen{ $_ }++ } @$_ for @sets; ], vec => q[ my $vector = ''; undef @vUniq; @vUniq = grep{ vec( $vector, $_, $B )++ } @$_ for @sets; ], }; print 'H:'. @hUniq, ' V: '. @vUniq; __END__ P:\test>438536 Rate hash vec hash 2.69/s -- -23% vec 3.47/s 29% -- H:954 V: 954 P:\test>438536 -S=20 Rate hash vec hash 16.4/s -- -41% vec 27.5/s 68% -- H:639 V: 639 P:\test>438536 -S=20 -N=100 Rate hash vec hash 264/s -- -21% vec 336/s 27% -- H:61 V: 61 P:\test>438536 -S=20 -N=2000 Rate hash vec hash 7.33/s -- -36% vec 11.5/s 56% -- H:1407 V: 1407 P:\test>438536 -S=20 -N=2000 -B=16 Rate hash vec hash 7.06/s -- -40% vec 11.8/s 67% -- H:1399 V: 1399

    If you could build and manipulate your sets as bit vectors, by storing your uniq values in an hash as you get them and setting bits in the vectors to represent them, ORing the bit vectors will get your union. You then use the set-bit positions as indexes into your array of unique values to reconstitute the union set.

    It works for any type of values and is very fast provided you can build and work with your sets that way.


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
      Only if they are non-negative integers....

      (Well, theoretically, for any set for which you have a fast, 1-to-1 mapping to the set of non-negative integers, this method will work. For instance, if all your members are negative integers, multiplying all members with -1 gives you positive integers to put inside the bit string)

        Only if they are non-negative integers...
        The positive integers can be mapped 1-to-1 on the entire set of integers (both sets have the same cardinality). You can construct a mapping so that the bit index is unique for any integer. For example:
        my $index = ( $integer >= 0 ) ? 2 * $index : -2 * $index - 1;
        Then $index is non-negative even for positive integers, and non-negative odd for negative integers.

        BTW, there is an interesting idea for Bottle Golf at Aleph-0 on Mathworld.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      I had the same thought when reading through Mastering Algorithms with Perl, but I'm actually working with strings here. I can't think of a cheap way to map strings to bits that would work for this. It may be worth trying to keep track of integers for these since I could probably do it cheaply as they are added to my database and then the union could be done this way.

        Yes. The mapping is the crux of the issue.

        If your doing the unions (or intersections, sym.diffs), on a regular basis, then it can be worth the effort of building a uniq index (offline) and replacing your sets of strings with bitvectors mapped against that index.

        You then hold and maintain your sets as bitvectors and all the set manipulations become easy and efficient, except adding (and to lesser extent displaying), which requires mapping.

        The index doesn't need to be ordered in any way, just unique. though ordering them does allow for the use of a binary chop for lookups when adding (or displaying).

        Whether the offline work of mapping can be amortised to effect an overall saving depends on how often your sets change and how often you need to do the unions.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
Re: better union of sets algorithm?
by demerphq (Chancellor) on Mar 11, 2005 at 08:43 UTC

    I think that you might try something like this:

    1. Modify the keys such that they store both the real key and the set they come from, something like "$key|$set" might work. The values used for $set should be something like "A", "B" etc.. IE, sortable.
    2. Sort the keys from the sets together. (If the lists are already sorted skip this set and just use a merge instead of a scan in the following step.)
    3. Scan the set. If there are duplicate keys from the same set they will be together. Ignore one of them. Count the dupe keys with different set values (they will be inorder, you only need to worry about dupes). If the number of keys is the same as the number of sets then you have a union.

    Whether this is faster will probably come down to data size. For really high volumes a hash approach wont work as it wont be sufficient memory efficient. (Remember a hash will normally have the same number of buckets as the next power of two larger than the number of keys in the hash.) For integers I would use BrowserUk's approach, for strings I would probably use a hash unless the data volume was really high (or i could be guaranteed the data was already in sorted form) and then i would go with something like the above.

    ---
    demerphq

      Thanks, but I think you're talking about an intersection, not a union. I want to get a unique list of all items that are in any set.
        so, for example (to clarify):

        my @a = qw( 1 2 3 3 2 ); my @b = qw( 1 4 5 5 4 ); my @c = qw( 1 6 7 7 6 );


        should give back
        1 2 3 4 5 6 7


        right? basically you could make one big list and find the uniques from that?
        --------------
        It's sad that a family can be torn apart by such a such a simple thing as a pack of wild dogs

        Oh, yeah, i did mean intersection. Union is even easier. Use the same procedure and ignore duplicate lines that are adjacent. This means you never have more than two lines in memory at once.

        ---
        demerphq

Re: better union of sets algorithm?
by JediWizard (Deacon) on Mar 11, 2005 at 15:58 UTC

    This comes from a standard utility library I wrote for myself years ago. uniqStrings will work faster, but if your data contains references they will be broken. uniqValues will work even if your list contains references.

    #_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_ # Function: uniqStrings() # Author: Dave Mueller # Returns: @uniq # Parameters: @array # Description: Returns a unique list of all Strings found in @array # Note: references wil be broken . . . #_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_ sub uniqStrings { my %temp = (); @temp{@_} = (); return keys %temp; } #_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_ # Function: uniqValues() # Author: Dave Mueller # Returns: @uniq # Parameters: @array # Description: Returns a unique list of all values found in @array #_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_ sub uniqValues { my %temp = (); @temp{@_} = (1 .. scalar(@_)); return map({$_[$_]} values %temp); }

    A truely compassionate attitude towards other does not change, even if they behave negatively or hurt you

    —His Holiness, The Dalai Lama

      Thanks, but that's the same as the one I quoted from perlfaq4.
Re: better union of sets algorithm?
by injunjoel (Priest) on Mar 11, 2005 at 20:14 UTC
    Greetings all,
    Final Update!
    So after my first mis-reading of the question (I computed the intersection not the union) and my subsequent code offering, I figured someone out there had to have already written this functionality, and I was right! So after a little research I would suggest List::Compare.

    Two week in the lab can save you two hours in the library...


    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
      Thanks, but that's an intersection, not a union.
Re: better intersection of sets algorithm?
by gam3 (Curate) on Mar 11, 2005 at 20:23 UTC
    This Code is for intersection for union code see Re^3: better union of sets algorithm?

    It seems that you can get a lot of improvement. Here are two tests I ran. The first is using two sets each with about half of /usr/dict/words and the third having the word zoo in it.

    Union wineglasses zoo
    Union wineglasses zoo
                (warning: too few iterations for a reliable count)
             s/iter   sorted unsorted
    sorted     4.34       --     -78%
    unsorted  0.956     354%       --
    Intersection wineglasses zoo
    Intersection wineglasses zoo
    
    The data for this run is
    @a1 = qw (a b c d g);
    @a2 = qw (a b c e d);
    @a3 = qw (a b c f g);
    
    Intersection a b c
    Intersection a b c
               Rate unsorted   sorted
    unsorted  205/s       --     -93%
    sorted   2775/s    1251%       --
    Intersection a b c
    Intersection a b c
    
    So even with a small data set the sorted method is faster. For both methods the sets must have unique elements (as all set do).
    sub intersection_1 { my $count = shift; $count--; my %saw; grep($count == $saw{$_}++, @_); } # This code is based on the SortedMultiList code but works # for an intersection of more than two sets. sub intersection_2 { my $self = [@_]; my ($lowest_arr, $lowest, $next); my @ret = (); my $elements = scalar @_; while (1) { my $count = 0; $lowest = $next; if (!defined($lowest)) { # find min element in queue foreach my $a (@$self) { return @ret unless (@$a); if (!defined($lowest) or ($a->[0] lt $lowest)) { $lowest = $a->[0]; } } } foreach my $a (@$self) { return @ret unless (@$a); if ($a->[0] eq $lowest) { shift @$a; $count++; } else { if (!defined($next) or ($a->[0] lt $next)) { $next = $a->[0]; } } } if ($count == $elements) { push(@ret, $lowest); } $lowest = $next; undef $next; } }
    Here is the Benchmark code. How the argments are passed may have some effect on the speed. intersection_2 modifies the arrays so I do the copies to make the test more fair.
    cmpthese -10, {
        unsorted => q[
            my @x1 = @a1;
            my @x2 = @a2;
            my @x3 = @a3;
            my @out = intersection_1(3, @x1, @x2, @x3);
        ],
        sorted => q[
            my @x1 = @a1;
            my @x2 = @a2;
            my @x3 = @a3;
            my @out = intersection_2(\@x1, \@x2, \@x3);
        ]
    };
    
    The SortedMultiList code is at Re: better union of sets algorithm?.
    A picture is worth a thousand words, but takes 200K.
      Thanks, but it looks like your code is computing the intersection of the sets, not the union.
        Yes, you are correct. How embaracing. The timing remains about the same though.
        Union_1 96235
        Union_2 96235
                    (warning: too few iterations for a reliable count)
                 s/iter   sorted unsorted
        sorted     4.39       --     -66%
        unsorted   1.48     196%       --
        Union_1 96235
        Union_2 96235
        Union a b c d g e f
        Union a b c d e f g
                   Rate unsorted   sorted
        unsorted  202/s       --     -92%
        sorted   2511/s    1144%       --
        Union a b c d g e f
        Union a b c d e f g
        
        sub union_1 { my $count = shift; $count--; my %saw; grep(0 == $saw{$_}++, @_); } sub union_2 { my $self = [@_]; my ($lowest_arr, $lowest, $next); my @ret = (); my $elements = scalar @_; while (1) { my $count = 0; my $work = 0; my $lowest; foreach my $a (@$self) { next unless (@$a); if (!defined($lowest) or ($a->[0] le $lowest)) { $lowest = $a->[0]; } } foreach my $a (@$self) { next unless (@$a); if ($a->[0] eq $lowest) { shift @$a; $work++; } } return @ret unless $lowest; push(@ret, $lowest); } }
        A picture is worth a thousand words, but takes 200K.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://438536]
Approved by cLive ;-)
Front-paged by deibyz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-04-19 22:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls