http://www.perlmonks.org?node_id=1067227


in reply to Comparing two arrays

My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array.

Convert your arrays of 0s 1s to bit-strings, then use bitwise-& and unpack '%32b*' to count the equivalences and you can do this 300+ times faster than comparing the arrays:

#! perl -slw use strict; use Benchmark qw[ cmpthese ]; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 500; our $I //= -1; our $N //= 1000; our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @xStrings = map{ join '', @$_ } @xArrays; our @yStrings = map{ join '', @$_ } @yArrays; our @xBits = map{ pack 'b*', $_ } @xStrings; our @yBits = map{ pack 'b*', $_ } @yStrings; cmpthese $I, { array => q[ my %top10s; for my $x ( 0 .. $#xArrays ) { for my $y ( 0 .. $#yArrays ) { my $count = 0; $xArrays[$x][$_] == 1 && $yArrays[$y][$_] == 1 and ++$ +count for 0 .. $#{ $xArrays[ 0 ] }; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' arrays: ', %top10s; ], strings => q[ my %top10s; for my $x ( 0 .. $#xStrings ) { for my $y ( 0 .. $#yStrings ) { my $count = ( $xStrings[$x] & $yStrings[$y] ) =~ tr[1] +[]; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp 'strings: ', %top10s; ], bits => q[ my %top10s; for my $x ( 0 .. $#xBits ) { for my $y ( 0 .. $#yBits ) { my $count = unpack '%32b*', ( $xBits[$x] & $yBits[$y] +); $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' bits: ', %top10s; ], }; __END__ C:\test>1067218 -N=100 Rate array strings bits array 1.95e-002/s -- -98% -100% strings 1.08/s 5417% -- -82% bits 5.97/s 30510% 455% --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Comparing two arrays
by baxy77bax (Deacon) on Dec 15, 2013 at 12:58 UTC
    thank you so much for the code and the benchmark, after seeing this i'll try to implement the strategy. However what i'm wondering now is where does the speed come from. When I search for a certain bit in a bit-string I remember reading somewhere that the bit is found by iterating through the memory block whereas accessing an array element is constant. is it possible that these constants are so large that it is cheaper to linearly scan through memory blocks or did i mixed up something (Which is probably the case). Could you please educate me a "bit" :)

    Thank you

    baxy

      i'm wondering now is where does the speed come from.

      Perhaps the simplest way to demonstrate the difference is to look at the number of opcodes generated in order to compare and count two sets of 64 bits stored as: two arrays; two strings of ascii 1s and 0s; two bitstrings of 64 bits each. You don't need to understand the opcodes to see the reduction.

      Moving as much of the work (looping) into the optimised, compiled-C, opcodes just saves huge swaths of time and processor:

      1. Arrays:
        C:\test>perl -MO=Terse -E"@a=map{int rand 2}1..64;@b=map{int rand 2}1. +.64; for my$a(@a){ for my $b(@b){ $a==$b and ++$count }}" LISTOP (0x34e7c58) leave [1] OP (0x34eec40) enter COP (0x34e7c98) nextstate BINOP (0x34e7d00) aassign [9] UNOP (0x34e7d70) null [142] OP (0x34e7d40) pushmark LOGOP (0x34e7e90) mapwhile [8] LISTOP (0x34e7f00) mapstart OP (0x34e7ed0) pushmark UNOP (0x34e7e58) null UNOP (0x34e7f40) null LISTOP (0x34e80d0) scope OP (0x34e8110) null [177] UNOP (0x34e8178) int [4] UNOP (0x34e81b0) rand [3] SVOP (0x34e81e8) const [7] IV +(0x33cca88) 2 UNOP (0x34e7f78) rv2av SVOP (0x34e7e20) const [26] AV (0x33c7570) UNOP (0x34e7de0) null [142] OP (0x34e7db0) pushmark UNOP (0x34e8220) rv2av [2] PADOP (0x34e8258) gv GV (0xa76c8) *a COP (0x34e7660) nextstate BINOP (0x34e76c8) aassign [18] UNOP (0x34e7738) null [142] OP (0x34e7708) pushmark LOGOP (0x34e7858) mapwhile [17] LISTOP (0x34e78c8) mapstart OP (0x34e7898) pushmark UNOP (0x34e7820) null UNOP (0x34e7908) null LISTOP (0x34e7a98) scope OP (0x34e7ad8) null [177] UNOP (0x34e7b40) int [13] UNOP (0x34e7b78) rand [12] SVOP (0x34e7bb0) const [16] IV + (0x33c6e30) 2 UNOP (0x34e7940) rv2av SVOP (0x34e77e8) const [27] AV (0x33c6830) UNOP (0x34e77a8) null [142] OP (0x34e7778) pushmark UNOP (0x34e7be8) rv2av [11] PADOP (0x34e7c20) gv GV (0x33c6f40) *b COP (0x34eecb0) nextstate BINOP (0x34eed18) leaveloop LOOP (0x34eee30) enteriter [19] OP (0x34eee88) null [3] UNOP (0x34eef28) null [142] OP (0x34eeef8) pushmark UNOP (0x34ef568) rv2av [21] PADOP (0x34e75b8) gv GV (0xa76c8) *a UNOP (0x34eed58) null LOGOP (0x34eed90) and OP (0x34eee00) iter LISTOP (0x34eef68) lineseq COP (0x34eefa8) nextstate BINOP (0x34ef010) leaveloop LOOP (0x34ef128) enteriter [22] OP (0x34ef180) null [3] UNOP (0x34ef220) null [142] OP (0x34ef1f0) pushmark UNOP (0x34ef4c8) rv2av [24] PADOP (0x34ef500) gv GV (0x33c6f4 +0) *b UNOP (0x34ef050) null LOGOP (0x34ef088) and OP (0x34ef0f8) iter LISTOP (0x34ef260) lineseq COP (0x34ef2a0) nextstate UNOP (0x34ef308) null LOGOP (0x34ef340) and BINOP (0x34ef428) eq OP (0x34ef498) padsv [ +19] OP (0x34ef468) padsv [ +22] UNOP (0x34ef380) preinc UNOP (0x34ef3b8) null +[15] PADOP (0x34ef3f0) +gvsv GV (0x33c5ed0) *count OP (0x34ef0c8) unstack OP (0x34eedd0) unstack -e syntax OK
      2. Strings:
        C:\test>perl -MO=Terse -E"$a=join'',map{int rand 2}1..64;@b=map{int ra +nd 2}1..64; $count=($a&$b)=~tr[1][]" LISTOP (0x3447bc0) leave [1] OP (0x344f178) enter COP (0x3447c00) nextstate BINOP (0x3447c68) sassign LISTOP (0x3447cd8) join [8] OP (0x3447ca8) pushmark SVOP (0x3448118) const [22] PV (0x332ca20) "" LOGOP (0x3447d88) mapwhile [7] LISTOP (0x3447df8) mapstart OP (0x3447dc8) pushmark UNOP (0x3447d50) null UNOP (0x3447e38) null LISTOP (0x3447fc8) scope OP (0x3448008) null [177] UNOP (0x3448070) int [3] UNOP (0x34480a8) rand [2] SVOP (0x34480e0) const [6] IV +(0x332cb58) 2 UNOP (0x3447e70) rv2av SVOP (0x3447d18) const [23] AV (0x3327640) UNOP (0x3448150) null [15] PADOP (0x3448188) gvsv GV (0xa76a8) *a COP (0x34475c8) nextstate BINOP (0x3447630) aassign [17] UNOP (0x34476a0) null [142] OP (0x3447670) pushmark LOGOP (0x34477c0) mapwhile [16] LISTOP (0x3447830) mapstart OP (0x3447800) pushmark UNOP (0x3447788) null UNOP (0x3447870) null LISTOP (0x3447a00) scope OP (0x3447a40) null [177] UNOP (0x3447aa8) int [12] UNOP (0x3447ae0) rand [11] SVOP (0x3447b18) const [15] IV + (0x3326f00) 2 UNOP (0x34478a8) rv2av SVOP (0x3447750) const [24] AV (0x3326900) UNOP (0x3447710) null [142] OP (0x34476e0) pushmark UNOP (0x3447b50) rv2av [10] PADOP (0x3447b88) gv GV (0x3327010) *b COP (0x344f1e8) nextstate BINOP (0x344f250) sassign UNOP (0x344f290) null BINOP (0x344f3e8) bit_and [21] UNOP (0x344f498) null [15] PADOP (0x34474e0) gvsv GV (0xa76a8) *a UNOP (0x344f428) null [15] PADOP (0x344f460) gvsv GV (0x3327010) *b PVOP (0x344f3b0) trans UNOP (0x3447518) null [15] PADOP (0x3447550) gvsv GV (0x33262d0) *count -e syntax OK
      3. Bits:
        C:\test>perl -MO=Terse -E"$a=int rand 2**64;$b=int rand 2**64; $count += unpack '%32b*', $a & $b" LISTOP (0x33e7460) leave [1] OP (0x33e6e60) enter COP (0x33e74a0) nextstate BINOP (0x33e7508) sassign UNOP (0x33e7548) int [4] UNOP (0x33e7580) rand [3] SVOP (0x33e75b8) const [13] NV (0x32ca498) 1.844674407 +37096e+019 UNOP (0x33e76a0) null [15] PADOP (0x33e76d8) gvsv GV (0x107668) *a COP (0x33e71f0) nextstate BINOP (0x33e7258) sassign UNOP (0x33e7298) int [8] UNOP (0x33e72d0) rand [7] SVOP (0x33e7308) const [14] NV (0x32ca5a0) 1.844674407 +37096e+019 UNOP (0x33e73f0) null [15] PADOP (0x33e7428) gvsv GV (0x32ca510) *b COP (0x33e6ed0) nextstate BINOP (0x33e6f38) sassign LISTOP (0x33e6fa8) unpack OP (0x33e6f78) null [3] SVOP (0x33e7108) const [15] PV (0x32ca600) "%32b*" BINOP (0x33e6fe8) bit_and [12] UNOP (0x33e7098) null [15] PADOP (0x33e70d0) gvsv GV (0x107668) *a UNOP (0x33e7028) null [15] PADOP (0x33e7060) gvsv GV (0x32ca510) *b UNOP (0x33e7140) null [15] PADOP (0x33e7178) gvsv GV (0x32ca5d0) *count -e syntax OK

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      Just be careful to create your data as bitstrings in the first place. If you create arrays and then turn them into bitstrings to do the comparison, then it is not that fast:

      use strict; use warnings; use Benchmark 'cmpthese'; sub create { map {rand() < $_[1] ? 1 : 0} 1..$_[0] } sub compare2a { # first find 1s in x, then check in ys my $x = shift; my $n = shift; my @nxs = grep { $x->[$_] } 0..$n-1; return map { scalar grep {$_} @{$_}[@nxs] } @_; } sub compare4 { # bitstrings my $x = shift; $x = pack 'b*', join '', @$x; return map { unpack '%32b*', ( $x & pack 'b*', join'',@$_ ) } @_; } my $n = 15000; my $p = 0.005; my $ny = 10; my @x = create $n, $p; my @ys = map { [ create $n, $p ] } 1..$ny; my @r2a = compare2a \@x, $n, @ys; my @r4 = compare4 \@x, @ys; print "compare2a: @r2a\n"; print "compare4: @r4\n"; cmpthese( -5, { compare2a => sub{ compare2a \@x, $n, @ys }, compare4 => sub{ compare4 \@x, @ys }, } );
      Result:
      Rate compare4 compare2a compare4 246/s -- -55% compare2a 543/s 120% --
        If you create arrays and then turn them into bitstrings [ everytime ] to do the comparison, then it is not that fast:

        No shit Sherlock :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.