http://www.perlmonks.org?node_id=1064178

newbieperlperson has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl experts,

Seeking advice here. I am a newbie with Perl and looking for input on the quickest way to perform a difference between two arrays.

The two arrays are going to be large, they could be holding between 6,000 to 8,000 elements, these elements will hold unique data. Due to the size of these arrays, the diff will need to be fast and not intensive on the CPU.

Here is the code I have used which does carry out its function correctly which is to find the items in @arr_1 that are not present in @arr_2.

The data in each of the arrays is unique and is an INT data type

My question is whether there is a faster way that is less intensive on the CPU?

  

    my %diff3;         @diff3{ @arr_1 } = @arr_1;     delete @diff3{  @arr_2};     @diff = (keys %diff3);

Thank you in advance, once I get up to speed on Perl, I am looking forward to paying it back.

AJ

Replies are listed 'Best First'.
Re: Best method to diff large array
by LanX (Saint) on Nov 25, 2013 at 04:29 UTC
    Using hash-slices like you demonstrated is the fastest way I know. (But you didn't show us your data)

    But I'm confused about the sorts.

    a) why do you think you need them? Sorting is comparatively slow!

    b) do you really have numeric data? otherwise <=> won't help!

    For completeness:

    If you only have scalars as data which stringifies in a unique way (i.e no references) you don't need to populate the values and just take the keys . @hash{@arr1}=().

    And I think you also want to calculate the symmetric difference, i.e. @arr2 \ @arr1 is missing.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    PS: Maybe of interest Using hashes for set operations...

      Hi Rolf,

      Thank you for responding.

      Good point on the sort, it is not required, I will remove that from my example.

      The goal for the code is to find what data is missing from @arr_1.

      AJ
Re: Best method to diff large array
by BrowserUk (Patriarch) on Nov 25, 2013 at 04:32 UTC

    Can there be duplicate values in either array?

    What information do you you need as you result?

    1. the overlap between the arrays?
    2. What is left in the first array, once anything also found in the second is removed?
    3. Or vice versa?
    4. Or both?
    5. Or all three?

    BTW: In your example, you name the keys that remain in the hash @dropped which doesn't, in isolation, make a lot of sense?

    Also, if you use a hash to determine this, there is no point in sorting the arrays first, it will make no difference to the result, but will just cost time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you for taking the time to respond.

      I agree, the sort is not required and I will remove that.

      The information I need is on the differences in @arr_1 that are not in @arr_2.

      I am not at work but will make the edits to the code tmw and check the results.

      I think @dropped is incorrect verbiage, I will change it to be @diff

        You didn't say whether the values in each of the two arrays are uniq?

        Also. what are the values in the arrays? Ie. strings, numbers, integers, small(ish) integers etc.?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Best method to diff very large array efficiently
by hdb (Monsignor) on Nov 25, 2013 at 08:36 UTC

    Instead of @diff3{ @arr_1 } = @arr_1; you can also say undef @diff3{ @arr_1 }; which creates hash entries with undef value and is pretty fast.

Re: Best method to diff very large array efficiently
by LanX (Saint) on Nov 25, 2013 at 12:51 UTC
Re: Best method to diff very large array efficiently
by zentara (Archbishop) on Nov 25, 2013 at 15:40 UTC
Re: Best method to diff very large array efficiently
by Kenosis (Priest) on Nov 25, 2013 at 19:20 UTC

    Given the OP's two arrays containing approximately 8k unique integer elements, it's evident from benchmarking (Perl v5.14.2 64-bit) that using a hash and grep to find the elements in @arr_1 that are not in @arr_2 may be a good choice:

    use strict; use warnings; use Set::Scalar; use List::Compare; use Benchmark qw/cmpthese/; my @arr_1 = 0 .. 8e3; my @arr_2 = 2e3 .. 1e4; sub setScalar { my $s1 = Set::Scalar->new(@arr_1); my $s2 = Set::Scalar->new(@arr_2); my $diff = $s1->difference($s2); } sub listCompare { my $lc = List::Compare->new( \@arr_1, \@arr_2 ); my @diff = $lc->get_Lonly; } sub OPdiff { my %diff3; @diff3{@arr_1} = @arr_1; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiffModified { my %diff3; @diff3{@arr_1} = (); delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiff_undef { my %diff3; undef @diff3{@arr_1}; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub using_vec { my $vec = ''; vec( $vec, $_, 1 ) = 1 for @arr_2; my @diff = grep !vec( $vec, $_, 1 ), @arr_1; } sub hash_grep { my %arr_2_hash; undef @arr_2_hash{@arr_2}; my @diff = grep !exists $arr_2_hash{$_}, @arr_1; } cmpthese( -5, { setScalar => sub { setScalar() }, listCompare => sub { listCompare() }, OPdiff => sub { OPdiff() }, OPdiffModified => sub { OPdiffModified() }, OPdiff_undef => sub { OPdiff_undef() }, using_vec => sub { using_vec() }, hash_grep => sub { hash_grep() } } );

    Output:

    Rate setScalar listCompare using_vec OPdiff OPdiffMod +ified hash_grep OPdiff_undef setScalar 7.94/s -- -72% -98% -98% + -98% -98% -99% listCompare 28.1/s 254% -- -92% -93% + -94% -94% -96% using_vec 349/s 4289% 1139% -- -7% + -27% -32% -47% OPdiff 375/s 4623% 1233% 8% -- + -22% -26% -42% OPdiffModified 478/s 5919% 1599% 37% 27% + -- -6% -27% hash_grep 510/s 6317% 1712% 46% 36% + 7% -- -22% OPdiff_undef 652/s 8105% 2217% 87% 74% + 36% 28% -- --

    Edit I: My thanks to LanX for redirecting my attention back to the suggested @diff3{@arr_1} = (), as it makes a significant difference in performance, as shown in the OPdiffModified() benchmark results. Thus, based upon benchmarking for this task, OPdiffModified() is the best of this group of diff solutions for the OP.

    Edit II: Thanks again to LanX (OK, I'll set up a new node for thanking LanX :), substituted @diff3{@arr_1} = () with undef @diff3{@arr_1} in OPdiff_undef(), and it's now the fastest.

      Like already explained, if keys are sufficient then setting values doesn't make sense (well the OP was updated w/o mention...)

      Changing this @diff3{@arr_1} = @arr_1; to  @diff3{@arr_1} = () makes some difference.

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        I undef @arr_2_hash{@arr_2}; in sub hash_grep(), noting it was faster than the OP's original.

        Changing this @diff3{@arr_1} = @arr_1; to @diff3{@arr_1} = () makes some difference.

        No--it makes a huge difference and it, by far, blows everything else away. Will make that change in a new sub and re-benchmark. Glad you mentioned it!

      Interesting. Here are the results of your benchmark (-Set::Scalar) run using my default perl (5.10.1 64-bit):

      C:\test>1064178-b.pl Rate listCompare OPdiff hash_grep OPdiffModified OPdi +ff_undef using_vec listCompare 12.9/s -- -86% -93% -94% + -94% -95% OPdiff 95.1/s 639% -- -48% -52% + -53% -65% hash_grep 185/s 1334% 94% -- -8% + -9% -32% OPdiffModified 200/s 1452% 110% 8% -- + -2% -27% OPdiff_undef 203/s 1478% 114% 10% 2% + -- -26% using_vec 273/s 2019% 187% 48% 37% + 34% --

      And this using 5.18 64-bit (also minus List::Compare):

      C:\test>\perl5.18\bin\perl 1064178-b.pl Rate OPdiff using_vec hash_grep OPdiffModified OP +diff_undef OPdiff 126/s -- -22% -31% -43% + -44% using_vec 162/s 28% -- -12% -26% + -28% hash_grep 183/s 45% 13% -- -17% + -19% OPdiffModified 220/s 74% 36% 20% -- + -2% OPdiff_undef 225/s 79% 39% 23% 2% + --

      They've really screwed up vec. ( Along with substr and a bunch of others :( )


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I wouldn't have predicted such a dramatic performance disparity across Perl versions. Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks. Greatly appreciate your informative reply!