If you step back from the structure a little to squint at the problem, you have two sets, and you want to derive the union (all elements in one, or the other, or both). In set theory "duplicates" merge down to a single entity. Bags are different in this regard. So to put it another way, for each element in your data set you want one unique element for as many duplicates of the element as appear in your data set.
You want to find the union, and an efficient way to do that is with a hash, which provides O(1) amortized inserts, and O(1) lookups.
my %seen;
my @union = grep {! $seen{$_}++} @rows, @data;
Because you're pushing most of the work of the looping into Perl's internals, this next approach may be a little faster:
my %seen;
@seen{@data, @rows} = ();
@data = keys %seen;
...the optimization being that you don't have to loop in Perl-code over the elements in @data. There are still an implicit loops in the formation and flattening of the hash slice, but they're handled inside of Perl, and while still linear, may be faster.
But still a faster approach may be using List::Util's uniq function, or the function by the same name from List::MoreUtils. In both cases it would look like:
@data = uniq @data, @rows;
Since both the uniq method, and the hash-slice method do most of their work in C land, they would probably be fairly evenly matched. You would have to benchmark if you care enough to know which one wins.
Here are some benchmarks:
use strict;
use warnings;
use Test::More;
use Benchmark qw(cmpthese);
use List::Util qw(uniq);
my @first = (1 .. 1000);
my @second = (300 .. 700);
sub slice {
my %seen;
@seen{@{$_[0]}, @{$_[1]}} = ();
return [keys %seen];
}
sub listutil {
return [uniq(@{$_[0]}, @{$_[1]})];
}
sub seengrep {
my %seen;
return [grep {! $seen{$_}++} @{$_[0]}, @{$_[1]}];
}
is_deeply [sort {$a <=> $b} @{slice( \@first, \@second)}],
[sort {$a <=> $b} @{listutil(\@first, \@second)}],
"They matched.";
is_deeply [sort {$a <=> $b} @{listutil(\@first, \@second)}],
[sort {$a <=> $b} @{seengrep(\@first, \@second)}],
"They matched again.";
done_testing();
cmpthese (-5, {
slice => sub {my $f = slice (\@first, \@second)},
listutil => sub {my $f = listutil(\@first, \@second)},
seengrep => sub {my $f = seengrep(\@first, \@second)},
});
And the output produced is:
ok 1 - They matched.
ok 2 - They matched again.
1..2
Rate seengrep slice listutil
seengrep 3555/s -- -16% -17%
slice 4233/s 19% -- -1%
listutil 4297/s 21% 2% --
So although the List::Util approach seems to always win on my systems, its margin vs the slice approach is very narrow. The grep/seen approach is about 17% slower.
But all of these approaches are linear in their order of growth as the data sets grow, whereas your approach is quadratic in growth. So as the size of your data grows, your original solution will become slower as a much faster rate than the hash-based solutions and the 'uniq' solution (which internally uses a hash as well).
To explain what I mean by order of growth a little better, consider the following two snippets of code:
my @list_A = (1..1000);
my @list_B = (1..1000);
# Exhibit A:
foreach my $element (@list_A, @list_B) {
do_something($element);
}
# Exhibit B:
foreach my $element_A (@list_A) {
foreach my $element_B(@list_B) {
do_something($element_A, $element_B);
}
}
Exhibit A will loop 2000 times. Exhibit B will loop 1000*1000 times, or 1,000,000 times. Smartmatch, when used like this: $target ~~ @array relies on an internal loop to iterate over all elements in @array to find the one(s) that match(es) $target. So your code, which does this:
foreach my $row (@rows) {
push @data, $row unless $row ~~ @data;
}
...is the similar to this:
foreach my $row (@rows) {
my $found = 0;
foreach my $element (@data) {
if ($row =~ m/\Q$element/) {
$found = 1;
last;
}
}
push @data, $row if $found;
}
As you can see, for each element in @rows you must iterate through every element in @data before it can be disqualified. And if it is not disqualified, then @data grows in size by one more for the next iteration on @rows.
On the other hand, @seen{@data,@rows} = () is the same as:
foreach my $element(@data, @rows) {
$seen{$_} = ();
}
There's only a single loop. As you add elements to @data or @rows the amount of work being done grows linerally. Whereas with your smartmatch approach, for every element you add to @data or @rows, the amount of work being done grows by n^2, or quadratically. A basic nested-loop approach (such as was implemented by the OP) is about 90% slower, or worse, when dealing with two lists of 1000 elements.
|