in reply to Best method to diff very large array efficiently
Given the OP's two arrays containing approximately 8k unique integer elements, it's evident from benchmarking (Perl v5.14.2 64-bit) that using a hash and grep to find the elements in @arr_1 that are not in @arr_2 may be a good choice:
use strict; use warnings; use Set::Scalar; use List::Compare; use Benchmark qw/cmpthese/; my @arr_1 = 0 .. 8e3; my @arr_2 = 2e3 .. 1e4; sub setScalar { my $s1 = Set::Scalar->new(@arr_1); my $s2 = Set::Scalar->new(@arr_2); my $diff = $s1->difference($s2); } sub listCompare { my $lc = List::Compare->new( \@arr_1, \@arr_2 ); my @diff = $lc->get_Lonly; } sub OPdiff { my %diff3; @diff3{@arr_1} = @arr_1; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiffModified { my %diff3; @diff3{@arr_1} = (); delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub OPdiff_undef { my %diff3; undef @diff3{@arr_1}; delete @diff3{@arr_2}; my @diff = ( keys %diff3 ); } sub using_vec { my $vec = ''; vec( $vec, $_, 1 ) = 1 for @arr_2; my @diff = grep !vec( $vec, $_, 1 ), @arr_1; } sub hash_grep { my %arr_2_hash; undef @arr_2_hash{@arr_2}; my @diff = grep !exists $arr_2_hash{$_}, @arr_1; } cmpthese( -5, { setScalar => sub { setScalar() }, listCompare => sub { listCompare() }, OPdiff => sub { OPdiff() }, OPdiffModified => sub { OPdiffModified() }, OPdiff_undef => sub { OPdiff_undef() }, using_vec => sub { using_vec() }, hash_grep => sub { hash_grep() } } );
Output:
Rate setScalar listCompare using_vec OPdiff OPdiffMod +ified hash_grep OPdiff_undef setScalar 7.94/s -- -72% -98% -98% + -98% -98% -99% listCompare 28.1/s 254% -- -92% -93% + -94% -94% -96% using_vec 349/s 4289% 1139% -- -7% + -27% -32% -47% OPdiff 375/s 4623% 1233% 8% -- + -22% -26% -42% OPdiffModified 478/s 5919% 1599% 37% 27% + -- -6% -27% hash_grep 510/s 6317% 1712% 46% 36% + 7% -- -22% OPdiff_undef 652/s 8105% 2217% 87% 74% + 36% 28% -- --
Edit I: My thanks to LanX for redirecting my attention back to the suggested @diff3{@arr_1} = (), as it makes a significant difference in performance, as shown in the OPdiffModified() benchmark results. Thus, based upon benchmarking for this task, OPdiffModified() is the best of this group of diff solutions for the OP.
Edit II: Thanks again to LanX (OK, I'll set up a new node for thanking LanX :), substituted @diff3{@arr_1} = () with undef @diff3{@arr_1} in OPdiff_undef(), and it's now the fastest.
|
---|