Fastest way to merge (and de-dup) two large arrays?

technojosh has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fastest way to merge (and de-dup) two large arrays? by Corion (Patriarch) on Aug 11, 2016 at 18:41 UTC
The most likely much faster way instead of using the smartmatch operator is to tell Perl that you'll be looking up things in the smaller array very often by converting the smaller array to a hash: `my %seen; $seen{ $_ } = 1 for @data; foreach my $rawData (@rows) { if( $seen{ $rawData } ) { # don't do anything, already in array } else { push(@data,$rawData); $seen{ $rawData } = 1; } }` [download]	[reply] [d/l]
Re^2: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 11, 2016 at 18:55 UTC
This was where I was going as well, but I was going to suggest hashing the larger array instead of the smaller one. I guess I assumed it would be faster to do fewer loop iterations checking for the existence of a key in a larger hash than the other way around. Is this something that would have to be tested on a data set to determine for sure, or is it a for sure thing to do it faster the way you show? I ask because I've done this exact thing before and I'm curious for an explanation on the efficiency difference. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply]
Re^3: Fastest way to merge (and de-dup) two large arrays? by Corion (Patriarch) on Aug 11, 2016 at 19:12 UTC
Every array element for the lookup will need to be hashed anyway, either for "storage" as a hash key or for lookup as a hash key, so that is a wash. I usually choose the smaller array for the hash because the smaller array is more likely to fit in memory.	[reply]
Re^4: Fastest way to merge (and de-dup) two large arrays? by SimonPratt (Friar) on Aug 12, 2016 at 07:59 UTC
Re^5: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 12, 2016 at 13:52 UTC
Re^2: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 11, 2016 at 19:34 UTC
Isn't the `$seen{ $rawData } = 1;` line only needed if it's necessary to check for duplicates within each individual array, as opposed to only checking for duplicates between the two arrays? If checking for such duplicates within each array isn't necessary then I think that could be a small efficiency gain too because the lookup hash wouldn't keep growing as the loop iterated. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply] [d/l]
Re^3: Fastest way to merge (and de-dup) two large arrays? by Corion (Patriarch) on Aug 11, 2016 at 19:37 UTC
`... push(@data,$rawData);` [download] `@data` gets modified whenever a new item is found, so we need to update our cache of `@data` as well.	[reply] [d/l] [select]
Re^4: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 12, 2016 at 13:41 UTC
Re^2: Fastest way to merge (and de-dup) two large arrays? by technojosh (Priest) on Aug 14, 2016 at 01:06 UTC
Hashes did the trick. The function ends now in minutes instead of hours. I think initially I was trying to keep things ordered, when there was no real requirement to do so. Way faster. -TJ	[reply]
Re: Fastest way to merge (and de-dup) two large arrays? by davido (Cardinal) on Aug 12, 2016 at 01:24 UTC
If you step back from the structure a little to squint at the problem, you have two sets, and you want to derive the union (all elements in one, or the other, or both). In set theory "duplicates" merge down to a single entity. Bags are different in this regard. So to put it another way, for each element in your data set you want one unique element for as many duplicates of the element as appear in your data set. You want to find the union, and an efficient way to do that is with a hash, which provides O(1) amortized inserts, and O(1) lookups. `my %seen; my @union = grep {! $seen{$_}++} @rows, @data;` [download] Because you're pushing most of the work of the looping into Perl's internals, this next approach may be a little faster: `my %seen; @seen{@data, @rows} = (); @data = keys %seen;` [download] ...the optimization being that you don't have to loop in Perl-code over the elements in `@data`. There are still an implicit loops in the formation and flattening of the hash slice, but they're handled inside of Perl, and while still linear, may be faster. But still a faster approach may be using List::Util's `uniq` function, or the function by the same name from List::MoreUtils. In both cases it would look like: `@data = uniq @data, @rows;` [download] Since both the `uniq` method, and the hash-slice method do most of their work in C land, they would probably be fairly evenly matched. You would have to benchmark if you care enough to know which one wins. Here are some benchmarks: use strict; use warnings; use Test::More; use Benchmark qw(cmpthese); use List::Util qw(uniq); my @first = (1 .. 1000); my @second = (300 .. 700); sub slice { my %seen; @seen{@{$_[0]}, @{$_[1]}} = (); return [keys %seen]; } sub listutil { return [uniq(@{$_[0]}, @{$_[1]})]; } sub seengrep { my %seen; return [grep {! $seen{$_}++} @{$_[0]}, @{$_[1]}]; } is_deeply [sort {$a <=> $b} @{slice( \@first, \@second)}], [sort {$a <=> $b} @{listutil(\@first, \@second)}], "They matched."; is_deeply [sort {$a <=> $b} @{listutil(\@first, \@second)}], [sort {$a <=> $b} @{seengrep(\@first, \@second)}], "They matched again."; done_testing(); cmpthese (-5, { slice => sub {my $f = slice (\@first, \@second)}, listutil => sub {my $f = listutil(\@first, \@second)}, seengrep => sub {my $f = seengrep(\@first, \@second)}, }); [download] And the output produced is: `ok 1 - They matched. ok 2 - They matched again. 1..2 Rate seengrep slice listutil seengrep 3555/s -- -16% -17% slice 4233/s 19% -- -1% listutil 4297/s 21% 2% --` [download] So although the List::Util approach seems to always win on my systems, its margin vs the slice approach is very narrow. The grep/seen approach is about 17% slower. But all of these approaches are linear in their order of growth as the data sets grow, whereas your approach is quadratic in growth. So as the size of your data grows, your original solution will become slower as a much faster rate than the hash-based solutions and the 'uniq' solution (which internally uses a hash as well). To explain what I mean by order of growth a little better, consider the following two snippets of code: `my @list_A = (1..1000); my @list_B = (1..1000); # Exhibit A: foreach my $element (@list_A, @list_B) { do_something($element); } # Exhibit B: foreach my $element_A (@list_A) { foreach my $element_B(@list_B) { do_something($element_A, $element_B); } }` [download] Exhibit A will loop 2000 times. Exhibit B will loop 1000*1000 times, or 1,000,000 times. Smartmatch, when used like this: `$target ~~ @array` relies on an internal loop to iterate over all elements in `@array` to find the one(s) that match(es) `$target`. So your code, which does this: `foreach my $row (@rows) { push @data, $row unless $row ~~ @data; }` [download] ...is the similar to this: `foreach my $row (@rows) { my $found = 0; foreach my $element (@data) { if ($row =~ m/\Q$element/) { $found = 1; last; } } push @data, $row if $found; }` [download] As you can see, for each element in `@rows` you must iterate through every element in `@data` before it can be disqualified. And if it is not disqualified, then `@data` grows in size by one more for the next iteration on `@rows`. On the other hand, `@seen{@data,@rows} = ()` is the same as: `foreach my $element(@data, @rows) { $seen{$_} = (); }` [download] There's only a single loop. As you add elements to `@data` or `@rows` the amount of work being done grows linerally. Whereas with your smartmatch approach, for every element you add to `@data` or `@rows`, the amount of work being done grows by n^2, or quadratically. A basic nested-loop approach (such as was implemented by the OP) is about 90% slower, or worse, when dealing with two lists of 1000 elements. Dave	[reply] [d/l] [select]
Re: Fastest way to merge (and de-dup) two large arrays? by BrowserUk (Patriarch) on Aug 11, 2016 at 19:39 UTC
What does the data look like? Ie. Numeric or strings? If numeric, what range? If strings, what length and alphabet? It is often possible to do this task more quickly or more compactly (or both) than using a hash; if the data is, or can be transformed into something amenable to other methods. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Fastest way to merge (and de-dup) two large arrays? by technojosh (Priest) on Aug 11, 2016 at 19:46 UTC
both are arrays of strings English alphabet, and nothing incredibly long... `length($string)` could be > 100, but nothing past 400, and could be much smaller (ie: just a few characters) -TJ	[reply] [d/l]
Re^3: Fastest way to merge (and de-dup) two large arrays? by BrowserUk (Patriarch) on Aug 11, 2016 at 19:59 UTC
arrays of strings English alphabet, could be > 100, Then a hash is your best bet. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Fastest way to merge (and de-dup) two large arrays? by Anonymous Monk on Aug 11, 2016 at 20:08 UTC
If you have a choice on how to keep your data, then the smart way is to keep it sorted. Merging the two collections is then a simple matter of streaming merge. Typically, this can be done as fast as you can do the i/o.	[reply]
Re: Fastest way to merge (and de-dup) two large arrays? by Anonymous Monk on Aug 11, 2016 at 22:22 UTC
Does something like this fit in memory with your data? `#!/usr/bin/perl # http://perlmonks.org/?node_id=1169599 use strict; use warnings; my @data = 1..6e4; # fake strings my @rows = 1+3e4..6e5+3e4; # fake strings my %combined; @combined{@data, @rows} = (); @data = keys %combined; print "size = " . @data, "\n";` [download]	[reply] [d/l]
Thank You by technojosh (Priest) on Aug 12, 2016 at 02:11 UTC
I'm going to start with the hash approach, I kind of had a hunch that's what the advice would be. I am interested in the modules though, and will be looking at them when I get a chance. -TJ	[reply]
Re: Fastest way to merge (and de-dup) two large arrays? by Anonymous Monk on Aug 11, 2016 at 18:47 UTC
Is preserving order important or irrelevant?	[reply]
Re^2: Fastest way to merge (and de-dup) two large arrays? by technojosh (Priest) on Aug 11, 2016 at 19:08 UTC
Irrelevant -TJ	[reply]
Re^3: Fastest way to merge (and de-dup) two large arrays? by perldigious (Priest) on Aug 11, 2016 at 19:13 UTC
Then as Corion suggested hashing instead of smart matching is probably a great way to go for run time efficiency. I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply]
Re^3: Fastest way to merge (and de-dup) two large arrays? by Anonymous Monk on Aug 11, 2016 at 19:18 UTC
On a *nix system, I would try writing both arrays out to the same file and run `sort -u` [download] on the file. Might be faster, might not...	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks