Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Idiom: hashes as sets

by w-ber (Hermit)
on Jul 02, 2008 at 18:21 UTC ( #695171=perlmeditation: print w/ replies, xml ) Need Help??

Dear monks,

recently I have found myself wanting to use sets, i.e. collections of unique values. I could use a CPAN package for it, but more often than not I simply write

my %set = (); my @new_items = ...; @set{@new_items} = (1) x @new_items;

Deleting items from the set is of course

delete @set{@other_items};

I don't know how common the idiom is. It may be inefficient with large sets and arrays, and the syntax is a bit peculiar for adding things to the set, but it saves conceptual effort and reduces running time compared to sorting and removing duplicate items later. You can also always reduce the set (hash) to an array with [keys %set].

Are there better ways to do it? Here, better can be any one of conceptually or syntactically clearer, less memory wasted in storing the items, or less memory wasted in adding items to the set.

A further advantage of the above idiom is that you can compute set difference and symmetric difference easily:

# Difference. sub difference { my %diff = %{ $_[0] }; delete @diff{ keys %{ $_[1] } }; \%diff; } # Symmetric difference. sub symmetric_difference { my %symm = (); my @set1 = keys %{ difference($_[0], $_[1]) }; my @set2 = keys %{ difference($_[1], $_[0]) }; @symm{@set1, @set2} = (1) x (@set1 + @set2); \%symm; }

An obvious problem with the idiom is auto-vivication, but as long as you're careful and consistent, it won't bite you.

--
print "Just Another Perl Adept\n";

Comment on Idiom: hashes as sets
Select or Download Code
Re: Idiom: hashes as sets
by perrin (Chancellor) on Jul 02, 2008 at 18:39 UTC
    One tweak to save memory: use undef instead of 1 for the value and use exists to check for an item in the set. It makes things harder to read though.

      I agree, and note what this means for inserting a list:

      my %set; @set{qw(I agree)} = (); print sort keys %set; # Iagree

      Just didn't want you to say

      @set{@list} = (undef) x scalar @list;

      even though that would work.

        I don't understand the reply. Could you, Narveson, please rephrase?
      use !undef and get the best of both worlds :)(tye learn me this)
        Whoa, 1 and undef are on par , optimization?
        #!/usr/bin/perl -- use Devel::Size qw(size total_size); use strict; use warnings; { my %h; for ( 1 .. 20_000 ){ $h{1}{$_}=1; $h{'$_'}{$_}=$_; $h{'undef'}{$_}=undef; $h{'!undef'}{$_}= !undef; $h{'!!undef'}{$_}= !!undef; $h{'!!!undef'}{$_}= !!!undef; $h{'sv_yes'}{$_}= !0; $h{'sv_no'}{$_}= !1; } printf "%10s %10s %10s\n", 'value', 'size', 'total_size'; printf "%10s %10d %10d\n",$_, size($h{$_}),total_size($h{$_}) for +sort keys %h; }
        ActivePerl Build 813, perl, v5.8.7 built for MSWin32-x86-multi-thread
        value size total_size !!!undef 660026 1420026 !!undef 660026 1400026 !undef 660026 1420026 $_ 660026 1328920 1 660026 980026 sv_no 660026 1400026 sv_yes 660026 1420026 undef 660026 900026
        strawberry-perl-5.10.0.1-1.exe , v5.10.0 built for MSWin32-x86-multi-thread
        value size total_size !!!undef 660014 1540014 !!undef 660014 1540014 !undef 660014 1540014 $_ 660014 1616018 1 660014 980014 sv_no 660014 1540014 sv_yes 660014 1540014 undef 660014 980014
Re: Idiom: hashes as sets
by dragonchild (Archbishop) on Jul 02, 2008 at 21:27 UTC
    You're going to run into problems with non-scalar items, such as objects and hashrefs and listrefs. This is why you want to use a CPAN module. This is also true with the idiomatic (but wrong) unique'ing function that uses a hash. Much better is to use the uniq() function from List::MoreUtils which has been fixed to handle these issues.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Ok, you've piqued my curiosity ... I've looked both at the perl and the XS versions of uniq, and can't discern the difference between it and the idiomatic my %seen; @uniq = grep {not $seen{$_}++} @random_values. Though, I have to grant that it took me an inordinate amount of time to figure out the difference in the XS code between the scalar context codepath and the list context codepath, so I could easily be missing a subtle nuance somewhere. Care to be more explicit in which way the idiomatic method is wrong?

        Here's uniq from List::MoreUtils:

        sub uniq (@) { my %h; map { $h{$_}++ == 0 ? $_ : () } @_; }

        I think this is equivalent to your "my %h; @uniq = grep { ! $h{$_}++ } @in;". What's important about both of them is that they don't do this:

        sub bad_uniq { my %h; @h{@in} = (); return keys %h; }

        ...which may be more succinctly (idiomatically) written as "keys %{{ map { $_ => 1 } @in }}".

        The difference is that keys will not return what was in the list to begin with but rather whatever those things come out as after being stringified. The solutions using map and grep both will stringify the stuff you feed them, but it's the stuff that's returned, and not the stringy leftovers that were used to tell which were duplicates.

        Update with a demonstration:

        use List::MoreUtils 'uniq'; use Data::Dumper; my $aref = [ 1, 2 ]; my @aref_duplicated = ( $aref, $aref, $aref ); my @u1 = uniq( @aref_duplicated ); my @u2 = keys %{{map {$_=>1} @aref_duplicated}}; print Data::Dumper->Dump( [ \@u1, \@u2 ], [ '*from_uniq', '*from_keys' ] ); __END__ @from_uniq = ( [ 1, 2 ] ); @from_keys = ( 'ARRAY(0x8153c28)' );

      This is why you want to use a CPAN module if the items in your set are not strings or numbers.

      If they are and you do not need any advanced set operations, then a module would be an overkill.

        Unless, of course, the following reasons apply:
        • I want to future-proof myself against changing my requirements for membership.
        • I want to future-proof myself against changing my requirements for operations performed on my set.
        • I want to not worry about bugs in my code, but would rather use battle-tested code.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Idiom: hashes as sets
by TGI (Vicar) on Jul 03, 2008 at 01:30 UTC

    This is a great idea. In fact, its such a great idea that its been done many times before. :) You've definitely hit on a useful technique.

    This is pretty similar to recipe 4.8 in the Perl Cookbook 1st edition.

    foreach $e (@a, @b) { $count{$e}++ } foreach $e (keys %count) { push(@union, $e); if ($count{$e} == 2) { push @isect, $e; } else { push @diff, $e; } }

    Here's my implementation with each operation calculated separatly.

    use strict; use warnings; my @one = qw( 1 2 3 5 6 8 10 ); my @two = qw( 2 4 6 8 10 ); print "One = @one\n"; print "Two = @two\n"; print "\n"; my %union; @union{@one, @two} = (); my @union = sort keys %union; print "Union = @union\n"; my (%one, %two); @one{@one} = (); @two{@two} = (); my %intersection = map { exists $one{$_} && exists $two{$_} ? ($_ => undef) : () } @one, @two; my @intersection = sort keys %intersection; print "Intersection = @intersection\n"; my %one_minus_two = map { exists $two{$_} ? () : ($_ => undef) } @one; my @one_minus_two = sort keys %one_minus_two; print "One - Two = @one_minus_two\n"; my %symdiff = map { 1 == ( exists($one{$_}) + exists($two{$_}) ) ? ($_ => undef) : () } @one, @two; my @symdiff = sort keys %symdiff; print "Symdiff = @symdiff\n";

    If I was doing a whole lot of set manipulation I'd look for a library that provides a complete suite of set theoretic operations--like Set::Object or Set::Scalar. For simple uniqueness, I'd probably stick with List::MoreUtils--some of its other functions like 'any' and 'all' offer such great readability improvements to my code that I have come to rely on it.


    TGI says moo

Re: Idiom: hashes as sets
by sundialsvc4 (Monsignor) on Jul 08, 2008 at 18:53 UTC

    (Nod...) It seems pretty common to me.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://695171]
Approved by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2014-09-17 21:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (100 votes), past polls