Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Fastest way to merge (and de-dup) two large arrays?

by technojosh (Priest)
on Aug 11, 2016 at 18:27 UTC ( [id://1169599]=perlquestion: print w/replies, xml ) Need Help??

technojosh has asked for the wisdom of the Perl Monks concerning the following question:

Been a while since I posted a question, and I've ran through super search a few ways today - Looking for advice on the following question:

What is the fastest way to merge (and de-dup) two large arrays?

I can already merge my two arrays and de-dup them, with the code below...

IMPORTANT NOTE: This code is already using a bunch of forks (not shown), on account of all the trips through the loop doing a smartmatch. I'm already pushing all cpu cores pretty hard, so threading/forking should not be part of your recommendation unless it would be a very different approach

# Depending on your version, you may need: no warnings 'experimental::smartmatch'; # I have two arrays, @rows and @data # @data is a pre-loaded array of strings (~60k elements) # @rows is a much larger array of strings (~600k elements) foreach my $rawData (@rows) { if( $rawData ~~ @data ) { # don't do anything, already in array } else { push(@data,$rawData) } }

My first thought is: am I doing this backwards? If @rows is much larger, should I just send the loop through @data instead (which would avoid 500k+ iterations?

My second thought is: This is the kind of thing Perl does very well - there must be something faster out there, besides iterating over every element individually. Maybe using hashes instead (nothing needs to be "in order" here)? As it is, this runs for hours using up 8 cores

Replies are listed 'Best First'.
Re: Fastest way to merge (and de-dup) two large arrays?
by Corion (Patriarch) on Aug 11, 2016 at 18:41 UTC

    The most likely much faster way instead of using the smartmatch operator is to tell Perl that you'll be looking up things in the smaller array very often by converting the smaller array to a hash:

    my %seen; $seen{ $_ } = 1 for @data; foreach my $rawData (@rows) { if( $seen{ $rawData } ) { # don't do anything, already in array } else { push(@data,$rawData); $seen{ $rawData } = 1; } }

      This was where I was going as well, but I was going to suggest hashing the larger array instead of the smaller one. I guess I assumed it would be faster to do fewer loop iterations checking for the existence of a key in a larger hash than the other way around. Is this something that would have to be tested on a data set to determine for sure, or is it a for sure thing to do it faster the way you show?

      I ask because I've done this exact thing before and I'm curious for an explanation on the efficiency difference.

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

        Every array element for the lookup will need to be hashed anyway, either for "storage" as a hash key or for lookup as a hash key, so that is a wash.

        I usually choose the smaller array for the hash because the smaller array is more likely to fit in memory.

      Isn't the $seen{ $rawData } = 1; line only needed if it's necessary to check for duplicates within each individual array, as opposed to only checking for duplicates between the two arrays? If checking for such duplicates within each array isn't necessary then I think that could be a small efficiency gain too because the lookup hash wouldn't keep growing as the loop iterated.

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
        ... push(@data,$rawData);

        @data gets modified whenever a new item is found, so we need to update our cache of @data as well.

      Hashes did the trick.

      The function ends now in minutes instead of hours. I think initially I was trying to keep things ordered, when there was no real requirement to do so.

      Way faster.

Re: Fastest way to merge (and de-dup) two large arrays?
by davido (Cardinal) on Aug 12, 2016 at 01:24 UTC

    If you step back from the structure a little to squint at the problem, you have two sets, and you want to derive the union (all elements in one, or the other, or both). In set theory "duplicates" merge down to a single entity. Bags are different in this regard. So to put it another way, for each element in your data set you want one unique element for as many duplicates of the element as appear in your data set.

    You want to find the union, and an efficient way to do that is with a hash, which provides O(1) amortized inserts, and O(1) lookups.

    my %seen; my @union = grep {! $seen{$_}++} @rows, @data;

    Because you're pushing most of the work of the looping into Perl's internals, this next approach may be a little faster:

    my %seen; @seen{@data, @rows} = (); @data = keys %seen;

    ...the optimization being that you don't have to loop in Perl-code over the elements in @data. There are still an implicit loops in the formation and flattening of the hash slice, but they're handled inside of Perl, and while still linear, may be faster.

    But still a faster approach may be using List::Util's uniq function, or the function by the same name from List::MoreUtils. In both cases it would look like:

    @data = uniq @data, @rows;

    Since both the uniq method, and the hash-slice method do most of their work in C land, they would probably be fairly evenly matched. You would have to benchmark if you care enough to know which one wins.

    Here are some benchmarks:

    use strict; use warnings; use Test::More; use Benchmark qw(cmpthese); use List::Util qw(uniq); my @first = (1 .. 1000); my @second = (300 .. 700); sub slice { my %seen; @seen{@{$_[0]}, @{$_[1]}} = (); return [keys %seen]; } sub listutil { return [uniq(@{$_[0]}, @{$_[1]})]; } sub seengrep { my %seen; return [grep {! $seen{$_}++} @{$_[0]}, @{$_[1]}]; } is_deeply [sort {$a <=> $b} @{slice( \@first, \@second)}], [sort {$a <=> $b} @{listutil(\@first, \@second)}], "They matched."; is_deeply [sort {$a <=> $b} @{listutil(\@first, \@second)}], [sort {$a <=> $b} @{seengrep(\@first, \@second)}], "They matched again."; done_testing(); cmpthese (-5, { slice => sub {my $f = slice (\@first, \@second)}, listutil => sub {my $f = listutil(\@first, \@second)}, seengrep => sub {my $f = seengrep(\@first, \@second)}, });

    And the output produced is:

    ok 1 - They matched. ok 2 - They matched again. 1..2 Rate seengrep slice listutil seengrep 3555/s -- -16% -17% slice 4233/s 19% -- -1% listutil 4297/s 21% 2% --

    So although the List::Util approach seems to always win on my systems, its margin vs the slice approach is very narrow. The grep/seen approach is about 17% slower.

    But all of these approaches are linear in their order of growth as the data sets grow, whereas your approach is quadratic in growth. So as the size of your data grows, your original solution will become slower as a much faster rate than the hash-based solutions and the 'uniq' solution (which internally uses a hash as well).

    To explain what I mean by order of growth a little better, consider the following two snippets of code:

    my @list_A = (1..1000); my @list_B = (1..1000); # Exhibit A: foreach my $element (@list_A, @list_B) { do_something($element); } # Exhibit B: foreach my $element_A (@list_A) { foreach my $element_B(@list_B) { do_something($element_A, $element_B); } }

    Exhibit A will loop 2000 times. Exhibit B will loop 1000*1000 times, or 1,000,000 times. Smartmatch, when used like this: $target ~~ @array relies on an internal loop to iterate over all elements in @array to find the one(s) that match(es) $target. So your code, which does this:

    foreach my $row (@rows) { push @data, $row unless $row ~~ @data; }

    ...is the similar to this:

    foreach my $row (@rows) { my $found = 0; foreach my $element (@data) { if ($row =~ m/\Q$element/) { $found = 1; last; } } push @data, $row if $found; }

    As you can see, for each element in @rows you must iterate through every element in @data before it can be disqualified. And if it is not disqualified, then @data grows in size by one more for the next iteration on @rows.

    On the other hand, @seen{@data,@rows} = () is the same as:

    foreach my $element(@data, @rows) { $seen{$_} = (); }

    There's only a single loop. As you add elements to @data or @rows the amount of work being done grows linerally. Whereas with your smartmatch approach, for every element you add to @data or @rows, the amount of work being done grows by n^2, or quadratically. A basic nested-loop approach (such as was implemented by the OP) is about 90% slower, or worse, when dealing with two lists of 1000 elements.


    Dave

Re: Fastest way to merge (and de-dup) two large arrays?
by BrowserUk (Patriarch) on Aug 11, 2016 at 19:39 UTC

    What does the data look like? Ie. Numeric or strings? If numeric, what range? If strings, what length and alphabet?

    It is often possible to do this task more quickly or more compactly (or both) than using a hash; if the data is, or can be transformed into something amenable to other methods.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      both are arrays of strings

      English alphabet, and nothing incredibly long... length($string) could be > 100, but nothing past 400, and could be much smaller (ie: just a few characters)

        arrays of strings English alphabet, could be > 100,

        Then a hash is your best bet.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fastest way to merge (and de-dup) two large arrays?
by Anonymous Monk on Aug 11, 2016 at 20:08 UTC

    If you have a choice on how to keep your data, then the smart way is to keep it sorted. Merging the two collections is then a simple matter of streaming merge. Typically, this can be done as fast as you can do the i/o.

Re: Fastest way to merge (and de-dup) two large arrays?
by Anonymous Monk on Aug 11, 2016 at 22:22 UTC

    Does something like this fit in memory with your data?

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1169599 use strict; use warnings; my @data = 1..6e4; # fake strings my @rows = 1+3e4..6e5+3e4; # fake strings my %combined; @combined{@data, @rows} = (); @data = keys %combined; print "size = " . @data, "\n";
Thank You
by technojosh (Priest) on Aug 12, 2016 at 02:11 UTC
    I'm going to start with the hash approach, I kind of had a hunch that's what the advice would be. I am interested in the modules though, and will be looking at them when I get a chance.
Re: Fastest way to merge (and de-dup) two large arrays?
by Anonymous Monk on Aug 11, 2016 at 18:47 UTC

    Is preserving order important or irrelevant?

        Then as Corion suggested hashing instead of smart matching is probably a great way to go for run time efficiency.

        I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
        I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

        On a *nix system, I would try writing both arrays out to the same file and run

        sort -u

        on the file. Might be faster, might not...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1169599]
Approved by johngg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-24 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found