Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello
Is there a standard (build-in) way in Perl to eliminate exact duplicates from an array of hashes. The following does not work:
use Data::Dumper;
use List::MoreUtils qw(uniq);
my @test_data = (
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "b" },
{ Tag1 => "1", Tag2 => "c" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "d" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "3"},
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my @unique = uniq @test_data;
print Dumper \@unique;
Re: Eliminate exact duplicates from array of hashes
by tybalt89 (Monsignor) on Oct 09, 2019 at 17:42 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11107249
use warnings;
use List::Util qw(uniq);
use Data::Dump qw(dd pp);
my @test_data = (
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "b" },
{ Tag1 => "1", Tag2 => "c" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "d" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "3"},
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my @unique = map eval, uniq map { pp $_ } @test_data;
dd \@unique;
Outputs:
[
{ Tag1 => 1, Tag2 => "a" },
{ Tag1 => 1, Tag2 => "b" },
{ Tag1 => 1, Tag2 => "c" },
{ Tag1 => 2, Tag2 => "a" },
{ Tag1 => 2, Tag2 => "d" },
{ Tag1 => 3 },
{ Tag1 => "sun", Tag2 => "a" },
]
| [reply] [d/l] [select] |
|
| [reply] |
Re: Eliminate exact duplicates from array of hashes
by haukex (Archbishop) on Oct 09, 2019 at 17:44 UTC
|
It depends a bit on the data. Can the hashes contain any other nested data structures, or is it really just an array of plain hashes, where the only values are strings? If you could say with absolute certainty that the keys and values will never contain a certain string, such as "\0", in that case you could use that as a string to join the key/value pairs of the hashes for a string comparison. Another option might be to stringify the hashes with Data::Dumper (a core module) and compare those strings ($Data::Dumper::Sortkeys needs to be on), although I personally don't think that's a very clean solution. Another (untested) idea might be to serialize the hashrefs with Storable for a more compact representation than what Data::Dumper produces (Fletch's JSON solution and tybalt89's Data::Dump solution are the same kind of idea). And if you want to do it with absolutely no modules at all (which I wouldn't really recommend), then you'll have to code it up in Perl, looping over the hashes to compare them, stepping deeper into the data structure if necessary.
So please let us know of some more details of your data structure.
| [reply] [d/l] [select] |
Re: Eliminate exact duplicates from array of hashes
by LanX (Saint) on Oct 09, 2019 at 19:09 UTC
|
use strict;
use warnings;
use Data::Dump qw/pp dd/;
my @test_data = (
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "b" },
{ Tag1 => "1", Tag2 => "c" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "d" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "3"},
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my %seen;
my @unique = grep { not $seen{pp $_}++ } @test_data;
#pp \%seen;
pp \@unique;
Sadly uniq doesn't offer to provide an optional block (analog to sort ) to emulate this behavior with code like
uniq { pp $_ } @test_data
Please be aware of possible side effects when having circular data.
| [reply] [d/l] [select] |
|
use strict;
use warnings;
use Data::Dumper;
use Test::More;
sub uniq_nds{
# uniqe nested data-structures
my %seen;
local $Data::Dumper::Sortkeys = 1;
local $Data::Dumper::Terse = 1;
grep { not $seen{Dumper $_}++ } @_;
}
my @test_data = (
{ Tag1 => 1, Tag2 => "a" },
{ Tag1 => 1, Tag2 => "a" },
{ Tag1 => 1, Tag2 => "b" },
{ Tag1 => 1, Tag2 => "c" },
{ Tag1 => 1, Tag2 => "a" },
{ Tag1 => 2, Tag2 => "a" },
{ Tag1 => 2, Tag2 => "d" },
{ Tag1 => 2, Tag2 => "a" },
{ Tag1 => 3 },
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my $got = [ uniq_nds @test_data ];
my $expected = [
{ Tag1 => 1, Tag2 => "a" },
{ Tag1 => 1, Tag2 => "b" },
{ Tag1 => 1, Tag2 => "c" },
{ Tag1 => 2, Tag2 => "a" },
{ Tag1 => 2, Tag2 => "d" },
{ Tag1 => 3 },
{ Tag1 => "sun", Tag2 => "a" },
];
is_deeply(
$got,
$expected,
'uniq AoH'
) or diag Dumper $got;
done_testing;
| [reply] [d/l] |
Re: Eliminate exact duplicates from array of hashes
by Fletch (Bishop) on Oct 09, 2019 at 17:38 UTC
|
It does work, it's just that you don't understand what it's comparing and why they look different to uniq. Each element in your @test_data is a separate hashref which will compare differently with another hashref which might have identical contents.
For this particular usage with this (relatively small) amount of data you might could kludge something up by serializing items and then checking the serialized version.
Update: Ooops, forgot to enable sorting of keys with canonical(1). Updated output as well. I did say it was a kludge . . .
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] [select] |
|
Interesting approach, thank you. However, it does not seem to work. It eliminates only 1 duplicate, while there are much more exact duplicates in my array (I guess it has to do with the order of the couples key/values inside the hash).
| [reply] |
Re: Eliminate exact duplicates from array of hashes
by NetWallah (Canon) on Oct 09, 2019 at 19:22 UTC
|
Using core modules only, and assuming a flat hashref (i.e. not nested),
The following should be faster than other methods that do external serializing:
my %unique = map { my @k=sort(keys %$_);
join("",@k, @$_{@k})
=> $_ }
@test_data;
print Dumper [values %unique];
"From there to here, from here to there, funny things are everywhere." -- Dr. Seuss
| [reply] [d/l] |
|
use strict;
use warnings;
use Data::Dumper;
my @test_data = (
{ a=>1, b=>2 },
{ ab => 12}
);
my %unique = map { my @k=sort(keys %$_);
join("",@k, @$_{@k})
=> $_ }
@test_data;
print Dumper [values %unique];
$VAR1 = [
{
'ab' => 12
}
];
updates
- replacing "" in join($; ,@k, @$_{@k}) solves this issue *
- please note that other solutions kept the order
*) in most cases
| [reply] [d/l] [select] |
|
OK - more robust version:
use strict;
use warnings;
use Digest::MD5;
use Data::Dumper;
my @test_data = (
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "b" },
{ Tag1 => "1", Tag2 => "c" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "d" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "3"},
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my @test2=(
{ a=>1, b=>2 },
{ ab => 12}
);
# for performance, MD5 is fastest.
my %unique = map {my @k=sort(keys %$_);
my $ctx=Digest::MD5->new() or die "Cannot make MD5 o
+bj";
$ctx->add($_ . $;) for @k;
$ctx->add("=>"); #separator for values
$ctx->add($_ . "+") for @$_{@k};
$ctx->digest() => $_ }
@test_data;
print Dumper [values %unique];
%unique = map {my @k=sort(keys %$_);
my $ctx=Digest::MD5->new() or die "Cannot make MD5 ob
+j";
$ctx->add($_ . $;) for @k;
$ctx->add("=>"); #separator for values
$ctx->add($_ . "+") for @$_{@k};
$ctx->digest() => $_ }
@test2;
print Dumper [values %unique];
"From there to here, from here to there, funny things are everywhere." -- Dr. Seuss
| [reply] [d/l] |
Re: Eliminate exact duplicates from array of hashes
by johngg (Canon) on Oct 09, 2019 at 22:29 UTC
|
use 5.026;
use warnings;
use Data::Dumper;
my @test_data = (
{ Tag1 => q{1}, Tag2 => q{a} },
{ Tag1 => q{1}, Tag2 => q{a} },
{ Tag1 => q{1}, Tag2 => q{b} },
{ Tag1 => q{1}, Tag2 => q{c} },
{ Tag1 => q{1}, Tag2 => q{a} },
{ Tag1 => q{2}, Tag2 => q{a} },
{ Tag1 => q{2}, Tag2 => q{d} },
{ Tag1 => q{2}, Tag2 => q{a} },
{ Tag1 => q{3} },
{ Tag1 => q{sun}, Tag2 => q{a} },
{ Tag1 => q{sun}, Tag2 => q{a} },
);
my @unique = do {
my %seen;
map { $_->[ 1 ] }
grep { ! $seen{ $_->[ 0 ] } ++ }
map {
my $rhItem = $_;
[
(
join qq{\x00},
map { join qq{\x00}, $_, $rhItem->{ $_ } }
sort keys %{ $rhItem }
),
$rhItem
]
}
@test_data;
};
print Data::Dumper
->new( [ \ @unique ], [ qw{ *unique } ] )
->Sortkeys( 1 )
->Dumpxs();
The output:-
@unique = (
{
'Tag1' => '1',
'Tag2' => 'a'
},
{
'Tag1' => '1',
'Tag2' => 'b'
},
{
'Tag1' => '1',
'Tag2' => 'c'
},
{
'Tag1' => '2',
'Tag2' => 'a'
},
{
'Tag1' => '2',
'Tag2' => 'd'
},
{
'Tag1' => '3'
},
{
'Tag1' => 'sun',
'Tag2' => 'a'
}
);
I hope this is of interest.
| [reply] [d/l] [select] |
|
my $null = qq{\x00};
my @test_data = (
{ "a${null}1${null}b"=>"2" },
{ a => 1, b => 2}
);
My second program gives the correct results, although it too could be defeated by sufficiently crafted data.
"From there to here, from here to there, funny things are everywhere." -- Dr. Seuss
| [reply] [d/l] |
Re: Eliminate exact duplicates from array of hashes
by Anonymous Monk on Oct 09, 2019 at 17:59 UTC
|
Thank you for the feedback (and for confirming there is not a standard/build-in solution to do this). My @test_data should contain only hashes as in my example (and probably with a bit of cleaning in the pre-process) always with the two (identical) keys. The values of the hashes should any digit/literal (words and so on), possibly containing also characters such as ,|'". (unicode strings). The size of @test_data can be big (100.000 hashes), but performance (time) is not an issue in this context.
| [reply] [d/l] [select] |
|
use strict; use warnings;
use Data::Dumper;
my @test_data = (
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "1", Tag2 => "b" },
{ Tag1 => "1", Tag2 => "c" },
{ Tag1 => "1", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "2", Tag2 => "d" },
{ Tag1 => "2", Tag2 => "a" },
{ Tag1 => "3"},
{ Tag1 => "sun", Tag2 => "a" },
{ Tag1 => "sun", Tag2 => "a" },
);
my %found;
my @unique;
for my $grp (@test_data) {
my $t1=$grp->{Tag1}//'';
my $t2=$grp->{Tag2}//'';
next if ($found{$t1}{$t2});
push @unique,$grp;
$found{$t1}{$t2}=1;
}
print Dumper \@unique;
as it saves the cost of serializing and the cost of repeated storage of the characters Tag1/Tag2 at the expense of only checking the two keys AND representing the undefined key value as a zero length character scalar.
$VAR1 = [
{
'Tag1' => '1',
'Tag2' => 'a'
},
{
'Tag2' => 'b',
'Tag1' => '1'
},
{
'Tag1' => '1',
'Tag2' => 'c'
},
{
'Tag1' => '2',
'Tag2' => 'a'
},
{
'Tag2' => 'd',
'Tag1' => '2'
},
{
'Tag1' => '3'
},
{
'Tag1' => 'sun',
'Tag2' => 'a'
}
];
| [reply] [d/l] [select] |
|
|