Eliminate exact duplicates from array of hashes

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Eliminate exact duplicates from array of hashes by tybalt89 (Monsignor) on Oct 09, 2019 at 17:42 UTC
#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11107249 use warnings; use List::Util qw(uniq); use Data::Dump qw(dd pp); my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my @unique = map eval, uniq map { pp $_ } @test_data; dd \@unique; [download] Outputs: `[ { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, ]` [download]	[reply] [d/l] [select]
Re^2: Eliminate exact duplicates from array of hashes by Anonymous Monk on Oct 09, 2019 at 18:00 UTC
This seems to do exactly what I was looking for. Thank you.	[reply]
Re: Eliminate exact duplicates from array of hashes by haukex (Archbishop) on Oct 09, 2019 at 17:44 UTC
It depends a bit on the data. Can the hashes contain any other nested data structures, or is it really just an array of plain hashes, where the only values are strings? If you could say with absolute certainty that the keys and values will never contain a certain string, such as `"\0"`, in that case you could use that as a string to join the key/value pairs of the hashes for a string comparison. Another option might be to stringify the hashes with Data::Dumper (a core module) and compare those strings (`$Data::Dumper::Sortkeys` needs to be on), although I personally don't think that's a very clean solution. Another (untested) idea might be to serialize the hashrefs with Storable for a more compact representation than what Data::Dumper produces (Fletch's JSON solution and tybalt89's Data::Dump solution are the same kind of idea). And if you want to do it with absolutely no modules at all (which I wouldn't really recommend), then you'll have to code it up in Perl, looping over the hashes to compare them, stepping deeper into the data structure if necessary. So please let us know of some more details of your data structure.	[reply] [d/l] [select]
Re: Eliminate exact duplicates from array of hashes by LanX (Saint) on Oct 09, 2019 at 19:09 UTC
I really like tybalt89's approach here: Re: Eliminate exact duplicates from array of hashes. Unfortunately it requires an eval step... Here a way to avoid it `use strict; use warnings; use Data::Dump qw/pp dd/; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my %seen; my @unique = grep { not $seen{pp $_}++ } @test_data; #pp \%seen; pp \@unique;` [download] Sadly uniq doesn't offer to provide an optional block (analog to `sort` ) to emulate this behavior with code like `uniq { pp $_ } @test_data` Please be aware of possible side effects when having circular data. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^2: Eliminate exact duplicates from array of hashes by LanX (Saint) on Oct 10, 2019 at 02:28 UTC
Here a version with Data::Dumper which is core. (though Data::Dump is always my first installation) use strict; use warnings; use Data::Dumper; use Test::More; sub uniq_nds{ # uniqe nested data-structures my %seen; local $Data::Dumper::Sortkeys = 1; local $Data::Dumper::Terse = 1; grep { not $seen{Dumper $_}++ } @_; } my @test_data = ( { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 1, Tag2 => "a" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my $got = [ uniq_nds @test_data ]; my $expected = [ { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, ]; is_deeply( $got, $expected, 'uniq AoH' ) or diag Dumper $got; done_testing; [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l]
Re: Eliminate exact duplicates from array of hashes by Fletch (Bishop) on Oct 09, 2019 at 17:38 UTC
It does work, it's just that you don't understand what it's comparing and why they look different to `uniq`. Each element in your `@test_data` is a separate hashref which will compare differently with another hashref which might have identical contents. For this particular usage with this (relatively small) amount of data you might could kludge something up by serializing items and then checking the serialized version. Update: Ooops, forgot to enable sorting of keys with `canonical(1)`. Updated output as well. I did say it was a kludge . . . Read more... (2 kB) The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re^2: Eliminate exact duplicates from array of hashes by Anonymous Monk on Oct 09, 2019 at 18:03 UTC
Interesting approach, thank you. However, it does not seem to work. It eliminates only 1 duplicate, while there are much more exact duplicates in my array (I guess it has to do with the order of the couples key/values inside the hash).	[reply]
Re: Eliminate exact duplicates from array of hashes by NetWallah (Canon) on Oct 09, 2019 at 19:22 UTC
Using core modules only, and assuming a flat hashref (i.e. not nested), The following should be faster than other methods that do external serializing: `my %unique = map { my @k=sort(keys %$_); join("",@k, @$_{@k}) => $_ } @test_data; print Dumper [values %unique];` [download] "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss	[reply] [d/l]
Re^2: Eliminate exact duplicates from array of hashes by LanX (Saint) on Oct 09, 2019 at 19:32 UTC
really? `use strict; use warnings; use Data::Dumper; my @test_data = ( { a=>1, b=>2 }, { ab => 12} ); my %unique = map { my @k=sort(keys %$_); join("",@k, @$_{@k}) => $_ } @test_data; print Dumper [values %unique];` [download] `$VAR1 = [ { 'ab' => 12 } ];` [download] updates replacing "" in `join($; ,@k, @$_{@k})` solves this issue * please note that other solutions kept the order *) in most cases Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^3: Eliminate exact duplicates from array of hashes by NetWallah (Canon) on Oct 09, 2019 at 21:36 UTC
OK - more robust version: use strict; use warnings; use Digest::MD5; use Data::Dumper; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my @test2=( { a=>1, b=>2 }, { ab => 12} ); # for performance, MD5 is fastest. my %unique = map {my @k=sort(keys %$_); my $ctx=Digest::MD5->new() or die "Cannot make MD5 o +bj"; $ctx->add($_ . $;) for @k; $ctx->add("=>"); #separator for values $ctx->add($_ . "+") for @$_{@k}; $ctx->digest() => $_ } @test_data; print Dumper [values %unique]; %unique = map {my @k=sort(keys %$_); my $ctx=Digest::MD5->new() or die "Cannot make MD5 ob +j"; $ctx->add($_ . $;) for @k; $ctx->add("=>"); #separator for values $ctx->add($_ . "+") for @$_{@k}; $ctx->digest() => $_ } @test2; print Dumper [values %unique]; [download] "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss	[reply] [d/l]
Re: Eliminate exact duplicates from array of hashes by johngg (Canon) on Oct 09, 2019 at 22:29 UTC
A serialization approach:- use 5.026; use warnings; use Data::Dumper; my @test_data = ( { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{1}, Tag2 => q{b} }, { Tag1 => q{1}, Tag2 => q{c} }, { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{2}, Tag2 => q{a} }, { Tag1 => q{2}, Tag2 => q{d} }, { Tag1 => q{2}, Tag2 => q{a} }, { Tag1 => q{3} }, { Tag1 => q{sun}, Tag2 => q{a} }, { Tag1 => q{sun}, Tag2 => q{a} }, ); my @unique = do { my %seen; map { $_->[ 1 ] } grep { ! $seen{ $_->[ 0 ] } ++ } map { my $rhItem = $_; [ ( join qq{\x00}, map { join qq{\x00}, $_, $rhItem->{ $_ } } sort keys %{ $rhItem } ), $rhItem ] } @test_data; }; print Data::Dumper ->new( [ \ @unique ], [ qw{ *unique } ] ) ->Sortkeys( 1 ) ->Dumpxs(); [download] The output:- `@unique = ( { 'Tag1' => '1', 'Tag2' => 'a' }, { 'Tag1' => '1', 'Tag2' => 'b' }, { 'Tag1' => '1', 'Tag2' => 'c' }, { 'Tag1' => '2', 'Tag2' => 'a' }, { 'Tag1' => '2', 'Tag2' => 'd' }, { 'Tag1' => '3' }, { 'Tag1' => 'sun', 'Tag2' => 'a' } );` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Eliminate exact duplicates from array of hashes by NetWallah (Canon) on Oct 10, 2019 at 01:31 UTC
This algorithm suffers from the same issue that LanX pointed out. The following data defeats the de-dup: `my $null = qq{\x00}; my @test_data = ( { "a${null}1${null}b"=>"2" }, { a => 1, b => 2} );` [download] My second program gives the correct results, although it too could be defeated by sufficiently crafted data. "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss	[reply] [d/l]
Re: Eliminate exact duplicates from array of hashes by Anonymous Monk on Oct 09, 2019 at 17:59 UTC
Thank you for the feedback (and for confirming there is not a standard/build-in solution to do this). My `@test_data` should contain only hashes as in my example (and probably with a bit of cleaning in the pre-process) always with the two (identical) keys. The values of the hashes should any digit/literal (words and so on), possibly containing also characters such as ,\|'". (unicode strings). The size of `@test_data` can be big (100.000 hashes), but performance (time) is not an issue in this context.	[reply] [d/l] [select]
Re^2: Eliminate exact duplicates from array of hashes by huck (Prior) on Oct 09, 2019 at 18:44 UTC
Now knowing more about the problem i might try use strict; use warnings; use Data::Dumper; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my %found; my @unique; for my $grp (@test_data) { my $t1=$grp->{Tag1}//''; my $t2=$grp->{Tag2}//''; next if ($found{$t1}{$t2}); push @unique,$grp; $found{$t1}{$t2}=1; } print Dumper \@unique; [download] as it saves the cost of serializing and the cost of repeated storage of the characters Tag1/Tag2 at the expense of only checking the two keys AND representing the undefined key value as a zero length character scalar. `$VAR1 = [ { 'Tag1' => '1', 'Tag2' => 'a' }, { 'Tag2' => 'b', 'Tag1' => '1' }, { 'Tag1' => '1', 'Tag2' => 'c' }, { 'Tag1' => '2', 'Tag2' => 'a' }, { 'Tag2' => 'd', 'Tag1' => '2' }, { 'Tag1' => '3' }, { 'Tag1' => 'sun', 'Tag2' => 'a' } ];` [download]	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks

Eliminate exact duplicates from array of hashes

updates