http://www.perlmonks.org?node_id=1064279


in reply to Best method to diff very large array efficiently

Given the OP's two arrays containing approximately 8k unique integer elements, it's evident from benchmarking (Perl v5.14.2 64-bit) that using a hash and grep to find the elements in @arr_1 that are not in @arr_2 may be a good choice:

use strict; use warnings; use Set::Scalar; use List::Compare; use Benchmark qw/cmpthese/; my @arr_1 = 0 .. 8e3; my @arr_2 = 2e3 .. 1e4; sub setScalar { my $s1 = Set::Scalar->new(@arr_1); my $s2 = Set::Scalar->new(@arr_2); my $diff = $s1->difference($s2); } sub listCompare { my $lc = List::Compare->new( \@arr_1, \@arr_2 ); my @diff = $lc->get_Lonly; } sub OPdiff { my %diff3; @diff3{@arr_1} = @arr_1; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiffModified { my %diff3; @diff3{@arr_1} = (); delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiff_undef { my %diff3; undef @diff3{@arr_1}; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub using_vec { my $vec = ''; vec( $vec, $_, 1 ) = 1 for @arr_2; my @diff = grep !vec( $vec, $_, 1 ), @arr_1; } sub hash_grep { my %arr_2_hash; undef @arr_2_hash{@arr_2}; my @diff = grep !exists $arr_2_hash{$_}, @arr_1; } cmpthese( -5, { setScalar => sub { setScalar() }, listCompare => sub { listCompare() }, OPdiff => sub { OPdiff() }, OPdiffModified => sub { OPdiffModified() }, OPdiff_undef => sub { OPdiff_undef() }, using_vec => sub { using_vec() }, hash_grep => sub { hash_grep() } } );

Output:

Rate setScalar listCompare using_vec OPdiff OPdiffMod +ified hash_grep OPdiff_undef setScalar 7.94/s -- -72% -98% -98% + -98% -98% -99% listCompare 28.1/s 254% -- -92% -93% + -94% -94% -96% using_vec 349/s 4289% 1139% -- -7% + -27% -32% -47% OPdiff 375/s 4623% 1233% 8% -- + -22% -26% -42% OPdiffModified 478/s 5919% 1599% 37% 27% + -- -6% -27% hash_grep 510/s 6317% 1712% 46% 36% + 7% -- -22% OPdiff_undef 652/s 8105% 2217% 87% 74% + 36% 28% -- --

Edit I: My thanks to LanX for redirecting my attention back to the suggested @diff3{@arr_1} = (), as it makes a significant difference in performance, as shown in the OPdiffModified() benchmark results. Thus, based upon benchmarking for this task, OPdiffModified() is the best of this group of diff solutions for the OP.

Edit II: Thanks again to LanX (OK, I'll set up a new node for thanking LanX :), substituted @diff3{@arr_1} = () with undef @diff3{@arr_1} in OPdiff_undef(), and it's now the fastest.

Replies are listed 'Best First'.
Re^2: Best method to diff very large array efficiently
by LanX (Saint) on Nov 25, 2013 at 21:20 UTC
    Like already explained, if keys are sufficient then setting values doesn't make sense (well the OP was updated w/o mention...)

    Changing this @diff3{@arr_1} = @arr_1; to  @diff3{@arr_1} = () makes some difference.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      I undef @arr_2_hash{@arr_2}; in sub hash_grep(), noting it was faster than the OP's original.

      Changing this @diff3{@arr_1} = @arr_1; to @diff3{@arr_1} = () makes some difference.

      No--it makes a huge difference and it, by far, blows everything else away. Will make that change in a new sub and re-benchmark. Glad you mentioned it!

        Well I think it depends on the testcase, I tried random numbers in an intervall 1..1e6 like BUK did.

        See my benchmark here RFC extending Benchmark.pm to facilitate CODEHASHREF

        Maybe I did something wrong ...

        ... but I'm not to keen to continue, IMHO all approaches are already fast enough.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        oops undef @hash{@arr} is significantly faster than @hash{@arr}=()

Re^2: Best method to diff very large array efficiently
by BrowserUk (Patriarch) on Nov 26, 2013 at 08:55 UTC

    Interesting. Here are the results of your benchmark (-Set::Scalar) run using my default perl (5.10.1 64-bit):

    C:\test>1064178-b.pl Rate listCompare OPdiff hash_grep OPdiffModified OPdi +ff_undef using_vec listCompare 12.9/s -- -86% -93% -94% + -94% -95% OPdiff 95.1/s 639% -- -48% -52% + -53% -65% hash_grep 185/s 1334% 94% -- -8% + -9% -32% OPdiffModified 200/s 1452% 110% 8% -- + -2% -27% OPdiff_undef 203/s 1478% 114% 10% 2% + -- -26% using_vec 273/s 2019% 187% 48% 37% + 34% --

    And this using 5.18 64-bit (also minus List::Compare):

    C:\test>\perl5.18\bin\perl 1064178-b.pl Rate OPdiff using_vec hash_grep OPdiffModified OP +diff_undef OPdiff 126/s -- -22% -31% -43% + -44% using_vec 162/s 28% -- -12% -26% + -28% hash_grep 183/s 45% 13% -- -17% + -19% OPdiffModified 220/s 74% 36% 20% -- + -2% OPdiff_undef 225/s 79% 39% 23% 2% + --

    They've really screwed up vec. ( Along with substr and a bunch of others :( )


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I wouldn't have predicted such a dramatic performance disparity across Perl versions. Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks. Greatly appreciate your informative reply!

        my benchmark was run on v5.10.0 built for i486-linux-gnu-thread-multi and vec was by far the slowest.

        3 differences:

        • I avoided allocating useless arrays for the result
        • Only tested with non-core modules (too lazy to install)
        • LBNL: I tested with random numbers out of 1..1e6 and you took compact intervals!

        Obviously vec scales badly the sparser the distribution of values become...

        IMO not very surprising.

        update

        found bug in benchmark, will correct later. vec still among slowest...

        update

        Thanks to BrowserUk for vividly commenting twice that the benchmark is broken, after I already mentioned that the benchmark is buggy.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        I wouldn't have predicted such a dramatic performance disparity across Perl versions.

        It took me by surprise also.

        I've done, and posted, this hashes versus vec benchmark many times over the years, yours was the first challenge to what I took to be simple fact.

        Its yet another plank in my rapidly growing conclusion that 5.10.1 was 'peak Perl'.

        Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks.

        I wholeheartedly concur and will endeavour to do the same in future.

        Greatly appreciate your informative reply!

        We both learned something. That's win-win. You can't ask for more :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.