Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

delete duplicate hash value's

by juo (Curate)
on Dec 06, 2002 at 01:51 UTC ( #217964=perlquestion: print w/replies, xml ) Need Help??
juo has asked for the wisdom of the Perl Monks concerning the following question:

Anybody has an idea to easy delete a hash key the moment it has duplicate & values in another key. In the example below you can see that Pieter Test and Sally cummings have the same value's. When all the values are the same in one of the keys then one of them can be removed either Sally cummings or Pieter Test.

%students = (); $students{"Nick Plato"}=( { "year"=>"2", "GPA"=>"2.5", "major"=>"Phys. Ed.", "email"=>"nplato\" } ); $students{"Mary Pitts"}=( { "year"=>"4", "GPA"=>"4", "major"=>"Economics", "major"=>"Economics", "email"=>"mpitts\" } ); $students{"Sally Cummings"}=( { "year"=>"1", "GPA"=>"3.3", "major"=>"Undecided", "major"=>"Undecided", "email"=>"scummings\" } $students{"Pieter Test"}=( { "year"=>"1", "GPA"=>"3.3", "major"=>"Undecided", "major"=>"Undecided", "email"=>"scummings\" } );

Replies are listed 'Best First'.
Re: delete duplicate hash value's
by BrowserUk (Pope) on Dec 06, 2002 at 02:28 UTC

    If your not choosy which duplicates get deleted (or rather which remain), you could try something like this.

    Effectively a variation on the standard idiom used for weeding duplicates from an array, it creates another hash from the values of the second level hashes and uses that to determine if a duplicate has been seen yet.

    my %seen; for my $key (keys %students) { my $value_key = "@{[values %{$students{$key}}]}"; if (exists $seen{$value_key}) { delete $students{$key}; } else { $seen{$value_key}++; } } undef %seen;

    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      This solution in creating its $value_key assumes that the values of the hash ("values" function) will be returned in the same order with respect to the key names for each of the different student records. I don't think that Perl guarantees that behavior.
        I don't think that Perl guarantees that behavior.

        In general, hashes are iterated in bucket order, (and Perl does guarentee that values are returned in the same order as keys). Bucket order is a function of the hashing algorithm used which is fixed in Perl. Even with the hash randomisation fix for the "algorithm complexity attack" on Perl's hashes--which changes the initalisation values used by the hashing algorithm, the ordering is guarenteed to remain the same for any given run of the program which is all that is required of the code above.

        Essentially, if a hash contains the same keys, keys (and therefore values) will return them in the same order, regardless of the order they were inserted in. This can be demonstrated to be so:

        #! perl -sw use 5.010; use strict; use List::Util qw[ shuffle ]; our $I ||= 1e6; sub genHash { my %hash; @hash{ shuffle 'a'..'d' } = 1 .. 4; return \%hash; } my $datum = join ' ', keys %{ genHash() }; warn $datum . "\n"; for my $i ( 1 .. $I ) { my $test = join ' ', keys %{ genHash() }; die "test failed after $i iters: $datum vs. $test\n" unless $datum eq $test; } say "Test passed for $I iterations" __END__ C:\test>junk2 c a b d Test passed for 1000000 iterations

        However, there is a caveat to this that obviously did not occur to me back in the day. Whilst the iteration order is independent of the insertion order, it is dependant upon the number of buckets in the hash.

        That is, if the hashes being compared contain the same keys--and have never contained any other keys--their iteration orders will be the same. But, if the hashes have different numbers of buckets; if for example, one of them has previously contained more keys some of which have subsequently been deleted; then their iteration orderings will differ:

        @hashA{ 'a'..'d' } = 1..4;; @hashB{ 'a'..'j' } = 1 .. 10;; delete $hashB{ $_ } for 'e' .. 'j';; print scalar %hashA;; 4/8 print scalar %hashB;; 4/16 print join ' ', keys %hashA;; c a b d print join ' ', keys %hashB;; a d c b

        So, whilst this is unlikely to have affected the OPs application, for general application it would be better to sort the values by key order. As would using a join delimiter that is not going to occur in the values being concatenated:

        my $value_key = join $;, @students{ sort keys %{ $students{ $key } } } +;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: delete duplicate hash value's
by Enlil (Parson) on Dec 06, 2002 at 02:22 UTC
    How are you initially populating the hash? Surely, you are not just typing in all the hash values into the script to populate it. It would be at this point where I would be insuring that "duplicate" things don't ever make it into the hash. For instance if it were a just a comma delimited ascii file, I would probably have another hash that collects the information that needs to be unique,concatenate it all together into the second hashes keys, and then check against that before allowing the data to be placed in the initial %students hash. I would do something similar if it were coming from a database, or another source.

    If you give us an idea of how the hash is initially populated, I think some monk will be better able to answer your question. Otherwise you can just take all the hash values and compare them against each other for each student and delete the ones that match exactly.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217964]
Approved by chromatic
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2017-09-24 16:54 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (274 votes). Check out past polls.