Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^5: Best method to diff very large array efficiently

by LanX (Canon)
on Nov 27, 2013 at 01:21 UTC ( #1064510=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Best method to diff very large array efficiently
in thread Best method to diff very large array efficiently

> found bug in benchmark, will correct later. vec still among slowest

as promised:

Integrity of the results are tested in a way that excludes possible sideeffects of benchmark-routines.

Benchmarks are run with different datasets with shrinking densities ("vec"-approach breaks >1e10)

Result: "vec" doesn't scale well for larger values...

Perlversion v5.10.0 Setting values range: 1..1e4 ok 1 - hash_grep ~~ hash_key (3231 entries) ok 2 - hash_key ~~ hash_values (3231 entries) ok 3 - hash_values ~~ using_vec (3231 entries) ok 4 - using_vec ~~ hash_grep (3231 entries) ok 5 - hash_grep ~~ hash_key (3231 entries) ok 6 - hash_key ~~ hash_values (3231 entries) ok 7 - hash_values ~~ using_vec (3231 entries) 1..7 Setting values range: 1..1e4 Rate hash_values hash_key using_vec hash_grep hash_values 29.8/s -- -45% -48% -48% hash_key 53.8/s 81% -- -7% -7% using_vec 57.7/s 93% 7% -- -0% hash_grep 57.7/s 93% 7% 0% -- Setting values range: 1..1e6 Rate hash_values hash_key using_vec hash_grep hash_values 28.5/s -- -24% -24% -25% hash_key 37.3/s 31% -- -0% -2% using_vec 37.3/s 31% 0% -- -2% hash_grep 38.2/s 34% 2% 2% -- Setting values range: 1..1e8 Rate using_vec hash_values hash_grep hash_key using_vec 17.1/s -- -41% -53% -57% hash_values 29.1/s 70% -- -20% -26% hash_grep 36.2/s 112% 24% -- -8% hash_key 39.5/s 131% 36% 9% -- Setting values range: 1..1e9 Rate using_vec hash_values hash_grep hash_key using_vec 3.29/s -- -88% -91% -92% hash_values 28.2/s 759% -- -21% -27% hash_grep 36.0/s 993% 27% -- -7% hash_key 38.8/s 1081% 38% 8% -- Compilation finished at Wed Nov 27 02:08:15
code:
use strict; use warnings; use Benchmark qw/cmpthese/; use Data::Dump qw/pp/; use feature 'say'; use Test::More; say "Perlversion $^V"; my (@arr_1,@arr_2); sub init_data{ my $range_limit=shift; say "\n\n Setting values range: 1..$range_limit\n"; my %unique; $unique{int rand $range_limit}=undef while keys %unique <8000; @arr_1=keys %unique; %unique=(); $unique{int rand $range_limit}=undef while keys %unique <6000; @arr_2=keys %unique; } { package CMP; my @res= ("") x 8000; sub hash_values_diff { my %diff3; @diff3{@arr_1} = @arr_1; delete @diff3{@arr_2}; @res = values %diff3 ; } sub hash_key_diff { my %diff3; undef @diff3{@arr_1}; delete @diff3{@arr_2}; @res = keys %diff3 ; } sub using_vec_diff { my $vec = ''; vec( $vec, $_, 1 ) = 1 for @arr_2; @res = grep !vec( $vec, $_, 1 ), @arr_1; } sub hash_grep_diff { my %arr_2_hash; undef @arr_2_hash{@arr_2}; @res = grep !exists $arr_2_hash{$_}, @arr_1; } } #--- Test benchmarks for correctness init_data("1e4"); test_subs('_diff$',"CMP"); #--- Compare benchmarks foreach my $range (qw/1e4 1e6 1e8 1e9/) { init_data($range); cmpthese(-5, pckg_subs('_diff$',"CMP") ); } #====== Utility functions for smarter benchmarks sub pckg_subs { my $filter = shift // qr[.]; my $pckg_name = shift // "CMP"; my $stash = do { no strict 'refs'; \ %{ "${pckg_name}::" }; }; # filter all subs from package my $codehashref; while (my ($name,$glob)= each %$stash) { if ( my $cref = *{$glob}{CODE} and $name =~ s/$filter// ) { # print "$name:\t$glob\n"; $codehashref->{$name}=$cref; } } return $codehashref; } sub test_subs { my $h_subs=pckg_subs(@_); my ($last_name,@last_res,$name,$code,@res); # compair pairwise, loop twice to exclude sideeffects for (1..2) { while (($name,$code) =each %$h_subs ) { @res = sort $code->(); is_deeply(\@last_res,\@res,"$last_name ~~ $name (". scalar @res +. " entries)") if $last_name; @last_res = @res; $last_name = $name; } } done_testing( ); }

Cheers Rolf

( addicted to the Perl Programming Language)


Comment on Re^5: Best method to diff very large array efficiently
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1064510]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (15)
As of 2015-07-02 13:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (40 votes), past polls