http://www.perlmonks.org?node_id=513460

eff_i_g has asked for the wisdom of the Perl Monks concerning the following question:

Pearls of Perl,

When collecting data that you only want unique values from, would you:

1. Use a hash; the keys being what you're interested in and the values being arbitrary;

2. Use an array; filtering it through a routine that makes it unique; or

3. ?

What is "best practice"?

Thank you :)

Replies are listed 'Best First'.
Re: To Hash or to Array--Uniqueness is the question.
by ptum (Priest) on Dec 01, 2005 at 22:51 UTC
    I will generally use a hash, with the keys reflecting the unique data elements. As I load the hash, I'll increment the value each time I encounter the key -- that way I have a unique set of keys, yet I can know (if I bother to check) about duplicate values in my original data set. This technique is good for when I am handling data that I need to be unique but might contain duplicates -- I can produce a warning or take some other action when I encounter the duplicate key (yet my main process, using the resulting hash keys, is unhampered by duplicates).

    No good deed goes unpunished. -- (attributed to) Oscar Wilde
      Thanks everyone! And to clarify, I am just capturing strings. I like your idea of counting ptum, it makes me feel better about using a hash and not "wasting" the value.
Re: To Hash or to Array--Uniqueness is the question.
by samtregar (Abbot) on Dec 01, 2005 at 22:39 UTC
    #1. Anytime you're thinking about uniqueness using a hash should be the first option.

    -sam

Re: To Hash or to Array--Uniqueness is the question.
by Nkuvu (Priest) on Dec 01, 2005 at 23:25 UTC

    I'm surprised no one has mentioned order yet. If you want unique values but you also want to preserve the order of the values, a simple hash isn't going to work. I would store the items into both structures, personally. Maybe not the most efficient way to do it, but set the value in the hash (keeping count if you like, as mentioned previously) and pushing onto an array if the hashed item didn't already exist. Something like:

    #!/usr/bin/perl use strict; use warnings; my (%found, @ordered_uniques); while (my $line = <DATA>) { chomp $line; next if $line !~ /\w/; # Skip non-word lines if (not exists $found{$line}) { push @ordered_uniques, $line; } $found{$line}++; } print "Ordered unique values:\n"; print (join ", ", @ordered_uniques); print "\n\nUnordered unique values:\n"; print (join ", ", keys %found); __DATA__ one foo bar one baz two quack baz
    which outputs the expected:
    Ordered unique values: one, foo, bar, baz, two, quack Unordered unique values: bar, baz, one, quack, foo, two

    So "best practice" depends on your application.

      I also use this type of solution under certain circumstances. The Tie::IxHash module (tied hash) takes care of storing it both ways, it keeps track of the insertion order into the hash

      I use the most powerful debugger available: print!
      A simple hash can be made to maintain order and since zero extra lines are used in the foregoing example (just a couple of inline additions), the overheads of interfacing to a module seem way too much. One should not reach for the nearest module the minute the going gets the slightest bit non-trivial - rather later in the thought process perhaps - it also seems appropriate to encourage people to think rather than be mentally lazy.
      my %h = (); while (<$fh>) { $h{ $_ } ||= $.; } close $fh; print sort { $h{a} <=> $h{b} } keys %h;

      -M

      Free your mind

        While i agree that this solution works, it only works in the very isolated situation where you are reading the keys from a file and that you get $. effectively for free (not really, but Perl is going to manipulate $. anyway)

        But in general this solution does not work directly. If the hash is being populated in a different way than by reading the keys from a file then $. is lost.

        Sure you could pick some other variable to be the order count for the keys, something like:

        my %h; my $h_order = 0; while ( my $key = get_key_from_somewhere() ) { if ( ! exists $h{$_} ) { $h{$_} = $h_order++; } }
        But then $h_order will need to be passed around with %h if any other values are going to be assigned, not to mention always performing the additional logic whenever assignments occur. If %h is going to be used in some function that i don't have control of the internals then the order is lost.

        I still stand by my Tie::IxHash solution because with it, other code that uses %h doesn't need to know anything special about it, it just operates like it was set up to operate.

        As for the overhead of tieing, it is rather small. Perhaps a 5-10% decrease in performance over using an untied hash, but the untied hash doesn't give the needed usage, so in implementing the functionality you need, you are losing the performance gain from not using the module.


        I use the most powerful debugger available: print!
        On the contrary, you're sorting the keys of the hash, but you lost the order of the input. Given the sample data I used in my script (one foo bar one baz two quack baz) the sort method prints out bar, baz, one, quack, foo, two. Of course if one wanted the sorted order then the use of the array is pointless. But if one is looking for the order of the input data, an array (or a module) is, as far as I know, necessary.
      Order is my general rule of thumb. I usually ask myself, "Does the order matter?" If I answer yes, I usually go with an array, if I answer no, I usually go with a hash.
Re: To Hash or to Array--Uniqueness is the question.
by ikegami (Patriarch) on Dec 01, 2005 at 22:51 UTC

    Hash inserts are O(1) (or close to), while uniquely inserting into an array is O(N) (unless you insert everything and remove dups later).

    The only reason you'd want something other than a hash is if you wanted something other than the string representation to be unique. And even then, it would be best if you could derive a unique string from the object and still use a hash.

    For example, case-insensitive uniqueness:

    $hash{lc($_)} = $_;

    For example, unique object id:

    $hash{$obj->{id}} = $obj;
Re: To Hash or to Array--Uniqueness is the question.
by sauoq (Abbot) on Dec 01, 2005 at 22:45 UTC
    When collecting data that you only want unique values from, would you:

    That depends on what your "values" are. If they are just strings or can be represented in a canonical form as strings than go ahead and use a hash. If the values in question are something more complex, then a hash isn't going to work for you as the keys of a hash must be strings.

    -sauoq
    "My two cents aren't worth a dime.";
    
      Tie::RefHash can be used to allow references to be used as keys. I use it to allow the use of filehandles as a key, without losing their magic.

        Sure, but in the context of this question—which is about dealing with uniqueness—that's not likely to be very helpful. Unless, of course, the uniqueness of the reference itself is sufficient. But if you have, say, my $a = [1]; my $b = [1]; and want to insert them into a store only if an identical structure doesn't already exist in that store... Tie::RefHash probably won't help because you'll need to do a 'deep' compare.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: To Hash or to Array--Uniqueness is the question.
by gjb (Vicar) on Dec 01, 2005 at 22:43 UTC

    Or 3. use Set::Scalar which implements a set.

    Just my 2 cents, -gjb-

Re: To Hash or to Array--Uniqueness is the question.
by davido (Cardinal) on Dec 02, 2005 at 06:54 UTC

    Using a hash for uniqueness is the canonical means of maintaining a unique set with Perl. However, the 'values' needn't be "arbitrary". There are all sorts of uses for the values, such as:

    • A count of how many times a particular entry has been seen.
    • An anonymous arrayref (or in other words, a HoA - hash of arrays), where the array pertaining to each hash entry contains a list of positions at which a particular entry was found.
    • A priority for a given entry (priority, sort order, timestamp, or whatever other info might be interesting).

    Your second option, "Use an array; filtering it through a routine that makes it unique" is probably going to still employ a hash as the means of filtering. The array simply would keep the entries in some specific order. But as mentioned above, you can keep track of order by simply using the filter hash's values.


    Dave

Re: To Hash or to Array--Uniqueness is the question.
by CountZero (Bishop) on Dec 02, 2005 at 09:02 UTC
    Never invent the wheel again: use the List::MoreUtils module which has a uniq-function.
    uniq LIST

    Returns a new list by stripping duplicate values in LIST. The order of elements in the returned list is the same as in LIST. In scalar context, returns the number of unique elements in LIST.

    my @x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 1 2 3 5 4 my $x = uniq 1, 1, 2, 2, 3, 5, 3, 4; # returns 5

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: To Hash or to Array--Uniqueness is the question.
by kbrint (Sexton) on Dec 02, 2005 at 08:47 UTC
    my (@items, @unique, %seen); for (@items) { push @unique, $_ unless $seen{$_}++; }
    This has a few benefits:
    • Deterministic ordering.
    • Works for objects and passes them through to @unique without stringification.
Re: To Hash or to Array--Uniqueness is the question.
by vennirajan (Friar) on Dec 02, 2005 at 10:23 UTC

    Its based on your requirement. I always prefer to use Hash. Because once you collected the data you can do lot of handy methods. For example, you can get the count of occurance of the data ( $hash{$data} .= + 1 ). Using the "value" we can say the number of occurances....etc. We can do some manipulations also using this key - value pairs. ( We can sort out the data by their occurance count,.etc ).later we can assign this unique data to the array also.( @array = keys %hashDataStructure );

    So, using the hash for unique is always preferable.
    Regards,
    S.Venni Rajan.
    "A Flair For Excellence."
        --- BK Systems.

Re: To Hash or to Array--Uniqueness is the question.
by Ryszard (Priest) on Dec 02, 2005 at 08:28 UTC
    Warning, untested code:
    my %stathash; while (<FH>) { $stathash{$_}++; }
    Has the extra advantage of counting the number of hits for each unique value.

    You can then do some grooy stuff, like pulling out records which occur n times, records which appear in one set and not another (if you use two hashes, two datasets), or records which appear in both sets, (again,if you use two hashes, two datasets)

    I regularly do this with sets of about 500k records to determine where my data integrity issues lie, its pretty damn fast.