Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: To Hash or to Array--Uniqueness is the question.

by Nkuvu (Priest)
on Dec 01, 2005 at 23:25 UTC ( #513481=note: print w/ replies, xml ) Need Help??


in reply to To Hash or to Array--Uniqueness is the question.

I'm surprised no one has mentioned order yet. If you want unique values but you also want to preserve the order of the values, a simple hash isn't going to work. I would store the items into both structures, personally. Maybe not the most efficient way to do it, but set the value in the hash (keeping count if you like, as mentioned previously) and pushing onto an array if the hashed item didn't already exist. Something like:

#!/usr/bin/perl use strict; use warnings; my (%found, @ordered_uniques); while (my $line = <DATA>) { chomp $line; next if $line !~ /\w/; # Skip non-word lines if (not exists $found{$line}) { push @ordered_uniques, $line; } $found{$line}++; } print "Ordered unique values:\n"; print (join ", ", @ordered_uniques); print "\n\nUnordered unique values:\n"; print (join ", ", keys %found); __DATA__ one foo bar one baz two quack baz
which outputs the expected:
Ordered unique values: one, foo, bar, baz, two, quack Unordered unique values: bar, baz, one, quack, foo, two

So "best practice" depends on your application.


Comment on Re: To Hash or to Array--Uniqueness is the question.
Select or Download Code
Re^2: To Hash or to Array--Uniqueness is the question.
by shemp (Deacon) on Dec 02, 2005 at 01:26 UTC
    I also use this type of solution under certain circumstances. The Tie::IxHash module (tied hash) takes care of storing it both ways, it keeps track of the insertion order into the hash

    I use the most powerful debugger available: print!
Re^2: To Hash or to Array--Uniqueness is the question.
by Moron (Curate) on Dec 02, 2005 at 16:19 UTC
    A simple hash can be made to maintain order and since zero extra lines are used in the foregoing example (just a couple of inline additions), the overheads of interfacing to a module seem way too much. One should not reach for the nearest module the minute the going gets the slightest bit non-trivial - rather later in the thought process perhaps - it also seems appropriate to encourage people to think rather than be mentally lazy.
    my %h = (); while (<$fh>) { $h{ $_ } ||= $.; } close $fh; print sort { $h{a} <=> $h{b} } keys %h;

    -M

    Free your mind

      On the contrary, you're sorting the keys of the hash, but you lost the order of the input. Given the sample data I used in my script (one foo bar one baz two quack baz) the sort method prints out bar, baz, one, quack, foo, two. Of course if one wanted the sorted order then the use of the array is pointless. But if one is looking for the order of the input data, an array (or a module) is, as far as I know, necessary.
        No, he's using the values of the hash to store the intended order, and sorting on the values instead of the keys. The downside is that instead of just keeping the order around, we're spending time sorting to get the order back.
      While i agree that this solution works, it only works in the very isolated situation where you are reading the keys from a file and that you get $. effectively for free (not really, but Perl is going to manipulate $. anyway)

      But in general this solution does not work directly. If the hash is being populated in a different way than by reading the keys from a file then $. is lost.

      Sure you could pick some other variable to be the order count for the keys, something like:

      my %h; my $h_order = 0; while ( my $key = get_key_from_somewhere() ) { if ( ! exists $h{$_} ) { $h{$_} = $h_order++; } }
      But then $h_order will need to be passed around with %h if any other values are going to be assigned, not to mention always performing the additional logic whenever assignments occur. If %h is going to be used in some function that i don't have control of the internals then the order is lost.

      I still stand by my Tie::IxHash solution because with it, other code that uses %h doesn't need to know anything special about it, it just operates like it was set up to operate.

      As for the overhead of tieing, it is rather small. Perhaps a 5-10% decrease in performance over using an untied hash, but the untied hash doesn't give the needed usage, so in implementing the functionality you need, you are losing the performance gain from not using the module.


      I use the most powerful debugger available: print!
        I agree that if $. isn't available you have to use a separate variable, which is hardly a big issue. But my criticism of tying was more a question of avoiding over-engineering than anything else. In addition, non-core modules are effectively prohibited for all the large corporate production environments I've ever worked in. Why 'effectively' rather than explicitly is another tale.

        -M

        Free your mind

Re^2: To Hash or to Array--Uniqueness is the question.
by choedebeck (Beadle) on Dec 02, 2005 at 19:19 UTC
    Order is my general rule of thumb. I usually ask myself, "Does the order matter?" If I answer yes, I usually go with an array, if I answer no, I usually go with a hash.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://513481]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2014-11-22 21:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (125 votes), past polls