Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Find duplicate values in hash

by moff (Novice)
on Apr 10, 2009 at 17:01 UTC ( [id://756878]=perlquestion: print w/replies, xml ) Need Help??

moff has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a file which looks like this:
Retire a document Dokument deaktivieren Remove a document from the knowledge base Dokument aus der Knowledg +e Base entfernen Promote document retirement Dokument deaktivieren Document Expired Dokument abgelaufen
English and translated strings are separated by a tab character. I found this and managed to put together the following code to read the file into a hash array:
use strict; use warnings; open (FILE, "extract_tab.txt") or die "$!\n"; my %hash; while (my $line = <FILE>) { chomp($line); my ($enu, $deu) = split /\t/, $line; $hash{$enu} = $deu; } for my $key (keys %hash) { print "$key\n"; } for my $value (values %hash) { print "$value\n"; }
As far as I can tell, the hash is created correctly (otherwise please tell me), and the keys are populated by the English strings and the values by the translated ones.

I now want to identify those keys that have identical values, and print them out. Something like "English strings 'Retire a document' and 'Promote document retirement' have the same translation ('Dokument deaktivieren')". I have been trying along the lines of the answers provided to this post, but I am getting stuck.

#my %seen; for my $key (keys %hash) { my $value_key = "@{[values %{$hash{$key}}]}"; #print "$value_key\n"; }
is giving me this error: Can't use string ("Dokument aus der Knowledge Base ") as a HASH ref while "strict refs" in use at hashtest.pl line 18, <FILE> line 4. Could someone help me understand? Or any hints for a different approach if it makes things easier? Thanks.

Replies are listed 'Best First'.
Re: Find duplicate values in hash
by JavaFan (Canon) on Apr 10, 2009 at 17:12 UTC
    Something like (untested):
    my %reverse; while (my ($key, $value) = each %hash) { push @{$reverse{$value}}, $key; } while (my ($key, $value) = each %reverse) { next unless @$value > 1; local $" = "' and '"; print "English strings '@$value' have the same translation ('$key' +)\n"; }
Re: Find duplicate values in hash
by FunkyMonk (Chancellor) on Apr 10, 2009 at 17:23 UTC
    I'd build a "reverse hash", but since the values in the original hash will be repeated, I'd use arrays for the values in the new hash:
    use Data::Dumper; my %hash = ( a => 'f', b => 'g', c => 'f', d => 'h', e => 'f', ); my %dups; while (my ($e, $g) = each %hash) { push @{$dups{$g}}, $e } while (my ($g, $e) = each %dups) { say "$g is duplicated in @$e" if @$e > 1; } print Dumper \%dups;

    You seemed unsure as whether you'd produced a hash in your OP. Data::Dumper1 is an excellent tool for inspecting arbitary data structures. Used as above it shows:

    $VAR1 = { 'h' => [ 'd' ], 'g' => [ 'b' ], 'f' => [ 'e', 'c', 'a' ] };
    [1] Actuallly, I prefer Data::Dump but it isn't a core module.

    Unless I state otherwise, all my code runs with strict and warnings
Re: Find duplicate values in hash
by toolic (Bishop) on Apr 10, 2009 at 17:24 UTC
    JavaFan has offered a good alternative approach to solve your problem.

    I thought that I'd point out a simpler way to display the contents of your hash, using Data::Dumper:

    use Data::Dumper; print Dumper(\%hash);

    which prints:

    $VAR1 = { 'Remove a document from the knowledge base' => 'Dokument aus + der Knowledge Base entfernen', 'Document Expired' => 'Dokument abgelaufen', 'Promote document retirement' => 'Dokument deaktivieren', 'Retire a document' => 'Dokument deaktivieren' };
Re: Find duplicate values in hash
by Nkuvu (Priest) on Apr 10, 2009 at 17:22 UTC

    I'd be keeping track of the keys when you're assigning to the definition hash. That is, keep a separate hash with duplicate keys. There is probably a more efficient way to do this, but some random puttering around while I'm waiting for my work script to finish:

    #!/usr/bin/perl use strict; use warnings; my (%hash, %dup_hash); # Minor tweak to read from DATA rather than a file while (my $line = <DATA>) { chomp($line); my ($enu, $deu) = split /\t/, $line; $hash{$enu} = $deu; # Keep a list of all duplicate values push @{$dup_hash{$deu}}, $enu; } for my $key (keys %hash) { print "$key\n"; } for my $value (values %hash) { print "$value\n"; } print "\nDuplicate definitions:\n"; for my $deu (keys %dup_hash) { if (scalar @{$dup_hash{$deu}} > 1) { for my $en (@{$dup_hash{$deu}}) { print "$deu => $en\n"; } print "\n"; } } __DATA__ Retire a document Dokument deaktivieren Remove a document from the knowledge base Dokument aus der Knowledg +e Base entfernen Promote document retirement Dokument deaktivieren Document Expired Dokument abgelaufen

    Gives the output:

    Remove a document from the knowledge base Document Expired Promote document retirement Retire a document Dokument aus der Knowledge Base entfernen Dokument abgelaufen Dokument deaktivieren Dokument deaktivieren Duplicate definitions: Dokument deaktivieren => Retire a document Dokument deaktivieren => Promote document retirement

    Edit: Renamed some of the variables to accurately reflect their contents. Second edit Pretty much the same thing as what JavaFan has, just different syntactical approach.

Re: Find duplicate values in hash
by whakka (Hermit) on Apr 10, 2009 at 21:25 UTC
    for my $key (keys %hash) { my $value_key = "@{[values %{$hash{$key}}]}"; }
    gives you an error because the %{...} in %{$hash{$key}} is attempting to treat the value of $hash{$key} as a hash ref - but it's really a scalar, as you say you want. So replace %{$hash{$key}} with %hash to make it work. I see the other post had a double-nested hash, which is why you would need this particular syntax.

    It also stores "@{[...]}" in a scalar used as a hash slice - normally you wouldn't need this and could just say values %hash. Hope this helps at all...

      Thank you all very much for the quick responses, all of them really helpful and instructive. So far I have tried JavaFan's and Nkuvu's solutions, and they work just fine. (I will need some time to understand exactly how they work though, might come back here if I fail to do so...). Thanks also for the pointers to Data::Dumper and the explanation as to why I was getting that error.
        Hello again, I have finally used JavaFan's code, changing it slightly to get an output similar to Nkuvu's, and managed to package it with PAR to get a standalone program. I can say it works flawlessly even with very large files with many thousands of lines. Thanks again. However, I still do not understand it thoroughly. I think I understand everything except
        push @{$reverse{$value}}, $key;
        What exactly happens here for each pair of keys and values of my first hash? I can see that, after that, in those cases where the value is the same for two or more keys, those two or more keys are contained in one "key array" (?) (one line if I print it out), and subsequently split with local $". But I do not really understand the above line of code. If JavaFan or someone else could explain it with words I would very much appreciate it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://756878]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-20 05:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found