Best method to diff very large array efficiently

newbieperlperson has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl experts,

Seeking advice here. I am a newbie with Perl and looking for input on the quickest way to perform a difference between two arrays.

The two arrays are going to be large, they could be holding between 6,000 to 8,000 elements, these elements will hold unique data. Due to the size of these arrays, the diff will need to be fast and not intensive on the CPU.

Here is the code I have used which does carry out its function correctly which is to find the items in @arr_1 that are not present in @arr_2.

The data in each of the arrays is unique and is an INT data type

My question is whether there is a faster way that is less intensive on the CPU?

    my %diff3;    
    @diff3{ @arr_1 } = @arr_1;
    delete @diff3{  @arr_2};
    @diff = (keys %diff3);
[download]

Thank you in advance, once I get up to speed on Perl, I am looking forward to paying it back.

Comment on Best method to diff very large array efficiently Download Code

Replies are listed 'Best First'.
Re: Best method to diff large array by LanX (Saint) on Nov 25, 2013 at 04:29 UTC
Using hash-slices like you demonstrated is the fastest way I know. (But you didn't show us your data) But I'm confused about the sorts. a) why do you think you need them? Sorting is comparatively slow! b) do you really have numeric data? otherwise `<=>` won't help! For completeness: If you only have scalars as data which stringifies in a unique way (i.e no references) you don't need to populate the values and just take the keys . `@hash{@arr1}=()`. And I think you also want to calculate the symmetric difference, i.e. @arr2 \ @arr1 is missing. Cheers Rolf ( addicted to the Perl Programming Language) PS: Maybe of interest Using hashes for set operations...	[reply] [d/l] [select]
Re^2: Best method to diff large array by newbieperlperson (Acolyte) on Nov 25, 2013 at 05:10 UTC
Hi Rolf, Thank you for responding. Good point on the sort, it is not required, I will remove that from my example. The goal for the code is to find what data is missing from @arr_1. AJ	[reply]
Re: Best method to diff large array by BrowserUk (Patriarch) on Nov 25, 2013 at 04:32 UTC
Can there be duplicate values in either array? What information do you you need as you result? the overlap between the arrays? What is left in the first array, once anything also found in the second is removed? Or vice versa? Or both? Or all three? BTW: In your example, you name the keys that remain in the hash `@dropped` which doesn't, in isolation, make a lot of sense? Also, if you use a hash to determine this, there is no point in sorting the arrays first, it will make no difference to the result, but will just cost time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Best method to diff large array by newbieperlperson (Acolyte) on Nov 25, 2013 at 05:13 UTC
Thank you for taking the time to respond. I agree, the sort is not required and I will remove that. The information I need is on the differences in @arr_1 that are not in @arr_2. I am not at work but will make the edits to the code tmw and check the results. I think @dropped is incorrect verbiage, I will change it to be @diff	[reply]
Re^3: Best method to diff large array by BrowserUk (Patriarch) on Nov 25, 2013 at 05:29 UTC
You didn't say whether the values in each of the two arrays are uniq? Also. what are the values in the arrays? Ie. strings, numbers, integers, small(ish) integers etc.? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Best method to diff large array by newbieperlperson (Acolyte) on Nov 25, 2013 at 05:39 UTC
Re^5: Best method to diff large array by BrowserUk (Patriarch) on Nov 25, 2013 at 06:24 UTC
Some notes below your chosen depth have not been shown here
Re: Best method to diff very large array efficiently by hdb (Monsignor) on Nov 25, 2013 at 08:36 UTC
Instead of `@diff3{ @arr_1 } = @arr_1;` you can also say `undef @diff3{ @arr_1 };` which creates hash entries with `undef` value and is pretty fast.	[reply] [d/l] [select]
Re: Best method to diff very large array efficiently by LanX (Saint) on Nov 25, 2013 at 12:51 UTC
After looking at Is it possible to run SQL select in Oracle and SQL Server to get a large recordset and import it straight into an array, looking to avoid using rownext I'm wondering if you are aware of SQL queries to select the diff of two rows. But I have never benchmarked it... Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re: Best method to diff very large array efficiently by zentara (Archbishop) on Nov 25, 2013 at 15:40 UTC
You might also find this node interesting: Difference Quantity of two Arrays I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 25, 2013 at 19:20 UTC
Given the OP's two arrays containing approximately 8k unique integer elements, it's evident from benchmarking (Perl v5.14.2 64-bit) that using a hash and grep to find the elements in @arr_1 that are not in @arr_2 may be a good choice: use strict; use warnings; use Set::Scalar; use List::Compare; use Benchmark qw/cmpthese/; my @arr_1 = 0 .. 8e3; my @arr_2 = 2e3 .. 1e4; sub setScalar { my $s1 = Set::Scalar->new(@arr_1); my $s2 = Set::Scalar->new(@arr_2); my $diff = $s1->difference($s2); } sub listCompare { my $lc = List::Compare->new( \@arr_1, \@arr_2 ); my @diff = $lc->get_Lonly; } sub OPdiff { my %diff3; @diff3{@arr_1} = @arr_1; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiffModified { my %diff3; @diff3{@arr_1} = (); delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiff_undef { my %diff3; undef @diff3{@arr_1}; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub using_vec { my $vec = ''; vec( $vec, $_, 1 ) = 1 for @arr_2; my @diff = grep !vec( $vec, $_, 1 ), @arr_1; } sub hash_grep { my %arr_2_hash; undef @arr_2_hash{@arr_2}; my @diff = grep !exists $arr_2_hash{$_}, @arr_1; } cmpthese( -5, { setScalar => sub { setScalar() }, listCompare => sub { listCompare() }, OPdiff => sub { OPdiff() }, OPdiffModified => sub { OPdiffModified() }, OPdiff_undef => sub { OPdiff_undef() }, using_vec => sub { using_vec() }, hash_grep => sub { hash_grep() } } ); [download] Output: `Rate setScalar listCompare using_vec OPdiff OPdiffMod +ified hash_grep OPdiff_undef setScalar 7.94/s -- -72% -98% -98% + -98% -98% -99% listCompare 28.1/s 254% -- -92% -93% + -94% -94% -96% using_vec 349/s 4289% 1139% -- -7% + -27% -32% -47% OPdiff 375/s 4623% 1233% 8% -- + -22% -26% -42% OPdiffModified 478/s 5919% 1599% 37% 27% + -- -6% -27% hash_grep 510/s 6317% 1712% 46% 36% + 7% -- -22% OPdiff_undef 652/s 8105% 2217% 87% 74% + 36% 28% -- --` [download] Edit I: My thanks to LanX for redirecting my attention back to the suggested `@diff3{@arr_1} = ()`, as it makes a significant difference in performance, as shown in the `OPdiffModified()` benchmark results. Thus, based upon benchmarking for this task, `OPdiffModified()` is the best of this group of diff solutions for the OP. Edit II: Thanks again to LanX (OK, I'll set up a new node for thanking LanX :), substituted `@diff3{@arr_1} = ()` with `undef @diff3{@arr_1}` in `OPdiff_undef()`, and it's now the fastest.	[reply] [d/l] [select]
Re^2: Best method to diff very large array efficiently by LanX (Saint) on Nov 25, 2013 at 21:20 UTC
Like already explained, if keys are sufficient then setting values doesn't make sense (well the OP was updated w/o mention...) Changing this `@diff3{@arr_1} = @arr_1;` to `@diff3{@arr_1} = ()` makes some difference. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^3: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 25, 2013 at 22:43 UTC
I `undef @arr_2_hash{@arr_2};` in `sub hash_grep()`, noting it was faster than the OP's original. Changing this @diff3{@arr_1} = @arr_1; to @diff3{@arr_1} = () makes some difference. No--it makes a huge difference and it, by far, blows everything else away. Will make that change in a new sub and re-benchmark. Glad you mentioned it!	[reply] [d/l] [select]
Re^4: Best method to diff very large array efficiently by LanX (Saint) on Nov 25, 2013 at 22:51 UTC
Re^5: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 25, 2013 at 23:00 UTC
Re^2: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 08:55 UTC
Interesting. Here are the results of your benchmark (-Set::Scalar) run using my default perl (5.10.1 64-bit): `C:\test>1064178-b.pl Rate listCompare OPdiff hash_grep OPdiffModified OPdi +ff_undef using_vec listCompare 12.9/s -- -86% -93% -94% + -94% -95% OPdiff 95.1/s 639% -- -48% -52% + -53% -65% hash_grep 185/s 1334% 94% -- -8% + -9% -32% OPdiffModified 200/s 1452% 110% 8% -- + -2% -27% OPdiff_undef 203/s 1478% 114% 10% 2% + -- -26% using_vec 273/s 2019% 187% 48% 37% + 34% --` [download] And this using 5.18 64-bit (also minus List::Compare): `C:\test>\perl5.18\bin\perl 1064178-b.pl Rate OPdiff using_vec hash_grep OPdiffModified OP +diff_undef OPdiff 126/s -- -22% -31% -43% + -44% using_vec 162/s 28% -- -12% -26% + -28% hash_grep 183/s 45% 13% -- -17% + -19% OPdiffModified 220/s 74% 36% 20% -- + -2% OPdiff_undef 225/s 79% 39% 23% 2% + --` [download] They've really screwed up vec. ( Along with substr and a bunch of others :( ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Best method to diff very large array efficiently by Kenosis (Priest) on Nov 26, 2013 at 18:15 UTC
I wouldn't have predicted such a dramatic performance disparity across Perl versions. Have updated my benchmarking post with "(Perl v5.14.2 64-bit)" - which I now think should always be included in benchmarks. Greatly appreciate your informative reply!	[reply]
Re^4: Best method to diff very large array efficiently by LanX (Saint) on Nov 26, 2013 at 18:53 UTC
Re^5: Best method to diff very large array efficiently by LanX (Saint) on Nov 27, 2013 at 01:21 UTC
Re^5: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 20:00 UTC
Re^5: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 20:34 UTC
Re^4: Best method to diff very large array efficiently by BrowserUk (Patriarch) on Nov 26, 2013 at 19:49 UTC

Back to Seekers of Perl Wisdom