Find duplicate values in hash

moff has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a file which looks like this:

Retire a document    Dokument deaktivieren
Remove a document from the knowledge base    Dokument aus der Knowledg
+e Base entfernen
Promote document retirement    Dokument deaktivieren
Document Expired    Dokument abgelaufen
[download]

English and translated strings are separated by a tab character. I found this and managed to put together the following code to read the file into a hash array:

use strict;
use warnings;

open (FILE, "extract_tab.txt") or die "$!\n";

my %hash;
while (my $line = <FILE>) {
    chomp($line);
    my ($enu, $deu) = split /\t/, $line;
    
    $hash{$enu} = $deu;
}

for my $key (keys %hash) {
print "$key\n";
}
for my $value (values %hash) {
print "$value\n";
}
[download]

As far as I can tell, the hash is created correctly (otherwise please tell me), and the keys are populated by the English strings and the values by the translated ones.

I now want to identify those keys that have identical values, and print them out. Something like "English strings 'Retire a document' and 'Promote document retirement' have the same translation ('Dokument deaktivieren')". I have been trying along the lines of the answers provided to this post, but I am getting stuck.

#my %seen;
for my $key (keys %hash) {
    my $value_key = "@{[values %{$hash{$key}}]}";
#print "$value_key\n";
}
[download]

is giving me this error: Can't use string ("Dokument aus der Knowledge Base ") as a HASH ref while "strict refs" in use at hashtest.pl line 18, <FILE> line 4. Could someone help me understand? Or any hints for a different approach if it makes things easier? Thanks.

Comment on Find duplicate values in hash Select or Download Code

Replies are listed 'Best First'.
Re: Find duplicate values in hash by JavaFan (Canon) on Apr 10, 2009 at 17:12 UTC
Something like (untested): `my %reverse; while (my ($key, $value) = each %hash) { push @{$reverse{$value}}, $key; } while (my ($key, $value) = each %reverse) { next unless @$value > 1; local $" = "' and '"; print "English strings '@$value' have the same translation ('$key' +)\n"; }` [download]	[reply] [d/l]
Re: Find duplicate values in hash by FunkyMonk (Chancellor) on Apr 10, 2009 at 17:23 UTC
I'd build a "reverse hash", but since the values in the original hash will be repeated, I'd use arrays for the values in the new hash: `use Data::Dumper; my %hash = ( a => 'f', b => 'g', c => 'f', d => 'h', e => 'f', ); my %dups; while (my ($e, $g) = each %hash) { push @{$dups{$g}}, $e } while (my ($g, $e) = each %dups) { say "$g is duplicated in @$e" if @$e > 1; } print Dumper \%dups;` [download] You seemed unsure as whether you'd produced a hash in your OP. Data::Dumper¹ is an excellent tool for inspecting arbitary data structures. Used as above it shows: `$VAR1 = { 'h' => [ 'd' ], 'g' => [ 'b' ], 'f' => [ 'e', 'c', 'a' ] };` [download] [1] Actuallly, I prefer Data::Dump but it isn't a core module. Unless I state otherwise, all my code runs with strict and warnings	[reply] [d/l] [select]
Re: Find duplicate values in hash by toolic (Bishop) on Apr 10, 2009 at 17:24 UTC
JavaFan has offered a good alternative approach to solve your problem. I thought that I'd point out a simpler way to display the contents of your hash, using Data::Dumper: `use Data::Dumper; print Dumper(\%hash);` [download] which prints: `$VAR1 = { 'Remove a document from the knowledge base' => 'Dokument aus + der Knowledge Base entfernen', 'Document Expired' => 'Dokument abgelaufen', 'Promote document retirement' => 'Dokument deaktivieren', 'Retire a document' => 'Dokument deaktivieren' };` [download]	[reply] [d/l] [select]
Re: Find duplicate values in hash by Nkuvu (Priest) on Apr 10, 2009 at 17:22 UTC
I'd be keeping track of the keys when you're assigning to the definition hash. That is, keep a separate hash with duplicate keys. There is probably a more efficient way to do this, but some random puttering around while I'm waiting for my work script to finish: #!/usr/bin/perl use strict; use warnings; my (%hash, %dup_hash); # Minor tweak to read from DATA rather than a file while (my $line = <DATA>) { chomp($line); my ($enu, $deu) = split /\t/, $line; $hash{$enu} = $deu; # Keep a list of all duplicate values push @{$dup_hash{$deu}}, $enu; } for my $key (keys %hash) { print "$key\n"; } for my $value (values %hash) { print "$value\n"; } print "\nDuplicate definitions:\n"; for my $deu (keys %dup_hash) { if (scalar @{$dup_hash{$deu}} > 1) { for my $en (@{$dup_hash{$deu}}) { print "$deu => $en\n"; } print "\n"; } } __DATA__ Retire a document Dokument deaktivieren Remove a document from the knowledge base Dokument aus der Knowledg +e Base entfernen Promote document retirement Dokument deaktivieren Document Expired Dokument abgelaufen [download] Gives the output: `Remove a document from the knowledge base Document Expired Promote document retirement Retire a document Dokument aus der Knowledge Base entfernen Dokument abgelaufen Dokument deaktivieren Dokument deaktivieren Duplicate definitions: Dokument deaktivieren => Retire a document Dokument deaktivieren => Promote document retirement` [download] Edit: Renamed some of the variables to accurately reflect their contents. Second edit Pretty much the same thing as what JavaFan has, just different syntactical approach.	[reply] [d/l] [select]
Re: Find duplicate values in hash by whakka (Hermit) on Apr 10, 2009 at 21:25 UTC
`for my $key (keys %hash) { my $value_key = "@{[values %{$hash{$key}}]}"; }` [download] gives you an error because the `%{...}` in `%{$hash{$key}}` is attempting to treat the value of `$hash{$key}` as a hash ref - but it's really a scalar, as you say you want. So replace `%{$hash{$key}}` with `%hash` to make it work. I see the other post had a double-nested hash, which is why you would need this particular syntax. It also stores `"@{[...]}"` in a scalar used as a hash slice - normally you wouldn't need this and could just say `values %hash`. Hope this helps at all...	[reply] [d/l] [select]
Re^2: Find duplicate values in hash by moff (Novice) on Apr 11, 2009 at 13:25 UTC
Thank you all very much for the quick responses, all of them really helpful and instructive. So far I have tried JavaFan's and Nkuvu's solutions, and they work just fine. (I will need some time to understand exactly how they work though, might come back here if I fail to do so...). Thanks also for the pointers to Data::Dumper and the explanation as to why I was getting that error.	[reply]
Re^3: Find duplicate values in hash by moff (Novice) on Apr 17, 2009 at 19:17 UTC
Hello again, I have finally used JavaFan's code, changing it slightly to get an output similar to Nkuvu's, and managed to package it with PAR to get a standalone program. I can say it works flawlessly even with very large files with many thousands of lines. Thanks again. However, I still do not understand it thoroughly. I think I understand everything except `push @{$reverse{$value}}, $key;` [download] What exactly happens here for each pair of keys and values of my first hash? I can see that, after that, in those cases where the value is the same for two or more keys, those two or more keys are contained in one "key array" (?) (one line if I print it out), and subsequently split with `local $"`. But I do not really understand the above line of code. If JavaFan or someone else could explain it with words I would very much appreciate it.	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks