Re: Best method to diff very large array efficiently

in reply to Best method to diff very large array efficiently

Given the OP's two arrays containing approximately 8k unique integer elements, it's evident from benchmarking (Perl v5.14.2 64-bit) that using a hash and grep to find the elements in @arr_1 that are not in @arr_2 may be a good choice:

use strict;
use warnings;
use Set::Scalar;
use List::Compare;
use Benchmark qw/cmpthese/;

my @arr_1 = 0 .. 8e3;
my @arr_2 = 2e3 .. 1e4;

sub setScalar {
    my $s1   = Set::Scalar->new(@arr_1);
    my $s2   = Set::Scalar->new(@arr_2);
    my $diff = $s1->difference($s2);
}

sub listCompare {
    my $lc = List::Compare->new( \@arr_1, \@arr_2 );
    my @diff = $lc->get_Lonly;
}

sub OPdiff {
    my %diff3;
    @diff3{@arr_1} = @arr_1;
    delete @diff3{@arr_2};
    my @diff = ( keys %diff3 );
}

sub OPdiffModified {
    my %diff3;
    @diff3{@arr_1} = ();
    delete @diff3{@arr_2};
    my @diff = ( keys %diff3 );
}

sub OPdiff_undef {
    my %diff3;
    undef @diff3{@arr_1};
    delete @diff3{@arr_2};
    my @diff = ( keys %diff3 );
}

sub using_vec {
    my $vec = '';
    vec( $vec, $_, 1 ) = 1 for @arr_2;
    my @diff = grep !vec( $vec, $_, 1 ), @arr_1;
}

sub hash_grep {
    my %arr_2_hash;
    undef @arr_2_hash{@arr_2};
    my @diff = grep !exists $arr_2_hash{$_}, @arr_1;
}

cmpthese(
    -5,
    {
        setScalar      => sub { setScalar() },
        listCompare    => sub { listCompare() },
        OPdiff         => sub { OPdiff() },
        OPdiffModified => sub { OPdiffModified() },
        OPdiff_undef   => sub { OPdiff_undef() },
        using_vec      => sub { using_vec() },
        hash_grep      => sub { hash_grep() }
    }
);
[download]

Output:

                 Rate setScalar listCompare using_vec OPdiff OPdiffMod
+ified hash_grep OPdiff_undef
setScalar      7.94/s        --        -72%      -98%   -98%          
+ -98%      -98%         -99%
listCompare    28.1/s      254%          --      -92%   -93%          
+ -94%      -94%         -96%
using_vec       349/s     4289%       1139%        --    -7%          
+ -27%      -32%         -47%
OPdiff          375/s     4623%       1233%        8%     --          
+ -22%      -26%         -42%
OPdiffModified  478/s     5919%       1599%       37%    27%          
+   --       -6%         -27%
hash_grep       510/s     6317%       1712%       46%    36%          
+   7%        --         -22%
OPdiff_undef    652/s     8105%       2217%       87%    74%          
+  36%       28%           --            --
[download]

Edit I: My thanks to LanX for redirecting my attention back to the suggested @diff3{@arr_1} = (), as it makes a significant difference in performance, as shown in the OPdiffModified() benchmark results. Thus, based upon benchmarking for this task, OPdiffModified() is the best of this group of diff solutions for the OP.

Edit II: Thanks again to LanX (OK, I'll set up a new node for thanking LanX :), substituted @diff3{@arr_1} = () with undef @diff3{@arr_1} in OPdiff_undef(), and it's now the fastest.

Comment on Re: Best method to diff very large array efficiently Select or Download Code

Replies are listed 'Best First'.
Re^2: Best method to diff very large array efficiently by LanX (Saint) on Nov 25, 2013 at 21:20 UTC
Like already explained, if keys are sufficient then setting values doesn't make sense (well the OP was updated w/o mention...) Changing this `@diff3{@arr_1} = @arr_1;` to `@diff3{@arr_1} = ()` makes some difference. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^3: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 25, 2013 at 22:43 UTC
I `undef @arr_2_hash{@arr_2};` in `sub hash_grep()`, noting it was faster than the OP's original. Changing this @diff3{@arr_1} = @arr_1; to @diff3{@arr_1} = () makes some difference. No--it makes a huge difference and it, by far, blows everything else away. Will make that change in a new sub and re-benchmark. Glad you mentioned it!	[reply] [d/l] [select]
Re^4: Best method to diff very large array efficiently by LanX (Saint) on Nov 25, 2013 at 22:51 UTC
Well I think it depends on the testcase, I tried random numbers in an intervall 1..1e6 like BUK did. See my benchmark here RFC extending Benchmark.pm to facilitate CODEHASHREF Maybe I did something wrong ... ... but I'm not to keen to continue, IMHO all approaches are already fast enough. Cheers Rolf ( addicted to the Perl Programming Language) update oops `undef @hash{@arr}` is significantly faster than `@hash{@arr}=()`	[reply] [d/l] [select]
Re^5: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 25, 2013 at 23:00 UTC
Re^2: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 08:55 UTC
Interesting. Here are the results of your benchmark (-Set::Scalar) run using my default perl (5.10.1 64-bit): `C:\test>1064178-b.pl Rate listCompare OPdiff hash_grep OPdiffModified OPdi +ff_undef using_vec listCompare 12.9/s -- -86% -93% -94% + -94% -95% OPdiff 95.1/s 639% -- -48% -52% + -53% -65% hash_grep 185/s 1334% 94% -- -8% + -9% -32% OPdiffModified 200/s 1452% 110% 8% -- + -2% -27% OPdiff_undef 203/s 1478% 114% 10% 2% + -- -26% using_vec 273/s 2019% 187% 48% 37% + 34% --` [download] And this using 5.18 64-bit (also minus List::Compare): `C:\test>\perl5.18\bin\perl 1064178-b.pl Rate OPdiff using_vec hash_grep OPdiffModified OP +diff_undef OPdiff 126/s -- -22% -31% -43% + -44% using_vec 162/s 28% -- -12% -26% + -28% hash_grep 183/s 45% 13% -- -17% + -19% OPdiffModified 220/s 74% 36% 20% -- + -2% OPdiff_undef 225/s 79% 39% 23% 2% + --` [download] They've really screwed up vec. ( Along with substr and a bunch of others :( ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 26, 2013 at 18:15 UTC
I wouldn't have predicted such a dramatic performance disparity across Perl versions. Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks. Greatly appreciate your informative reply!	[reply]
Re^4: Best method to diff very large array efficiently by LanX (Saint) on Nov 26, 2013 at 18:53 UTC
my benchmark was run on `v5.10.0 built for i486-linux-gnu-thread-multi` and `vec` was by far the slowest. 3 differences: I avoided allocating useless arrays for the result Only tested with non-core modules (too lazy to install) LBNL: I tested with random numbers out of 1..1e6 and you took compact intervals! Obviously `vec` scales badly the sparser the distribution of values become... IMO not very surprising. update found bug in benchmark, will correct later. vec still among slowest... update Thanks to BrowserUk for vividly commenting twice that the benchmark is broken, after I already mentioned that the benchmark is buggy. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^5: Best method to diff very large array efficiently by LanX (Saint) on Nov 27, 2013 at 01:21 UTC
Re^5: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 20:00 UTC
Re^5: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 20:34 UTC
Re^4: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 19:49 UTC
I wouldn't have predicted such a dramatic performance disparity across Perl versions. It took me by surprise also. I've done, and posted, this hashes versus vec benchmark many times over the years, yours was the first challenge to what I took to be simple fact. Its yet another plank in my rapidly growing conclusion that 5.10.1 was 'peak Perl'. Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks. I wholeheartedly concur and will endeavour to do the same in future. Greatly appreciate your informative reply! We both learned something. That's win-win. You can't ask for more :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]

In Section Seekers of Perl Wisdom

update

update

update