Keep It Simple, Stupid PerlMonks

### Re: Comparing two arrays

by BrowserUk (Pope)
 on Dec 15, 2013 at 12:16 UTC ( #1067227=note: print w/replies, xml ) Need Help??

in reply to Comparing two arrays

My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array.

Convert your arrays of 0s 1s to bit-strings, then use bitwise-& and unpack '%32b*' to count the equivalences and you can do this 300+ times faster than comparing the arrays:

```#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];
use Data::Dump qw[ pp ]; \$Data::Dump::WIDTH = 500;

our \$I //= -1;
our \$N //= 1000;

our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;
our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;

our @xStrings = map{ join '', @\$_  } @xArrays;
our @yStrings = map{ join '', @\$_  } @yArrays;

our @xBits = map{ pack 'b*', \$_ } @xStrings;
our @yBits = map{ pack 'b*', \$_ } @yStrings;

cmpthese \$I, {
array => q[
my %top10s;
for my \$x ( 0 .. \$#xArrays ) {
for my \$y ( 0 .. \$#yArrays ) {
my \$count = 0;
\$xArrays[\$x][\$_] == 1 && \$yArrays[\$y][\$_] == 1 and ++\$
+count for 0 .. \$#{ \$xArrays[ 0 ] };
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp ' arrays: ', %top10s;
],
strings => q[
my %top10s;
for my \$x ( 0 .. \$#xStrings ) {
for my \$y ( 0 .. \$#yStrings ) {
my \$count = ( \$xStrings[\$x] & \$yStrings[\$y] ) =~ tr[1]
+[];
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s  )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp 'strings: ', %top10s;
],
bits => q[
my %top10s;
for my \$x ( 0 .. \$#xBits ) {
for my \$y ( 0 .. \$#yBits ) {
my \$count = unpack '%32b*', ( \$xBits[\$x] & \$yBits[\$y]
+);
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp '   bits: ', %top10s;
],
};

__END__
C:\test>1067218 -N=100
Rate   array strings    bits
array   1.95e-002/s      --    -98%   -100%
strings      1.08/s   5417%      --    -82%
bits         5.97/s  30510%    455%      --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Comparing two arrays
by baxy77bax (Chaplain) on Dec 15, 2013 at 12:58 UTC
thank you so much for the code and the benchmark, after seeing this i'll try to implement the strategy. However what i'm wondering now is where does the speed come from. When I search for a certain bit in a bit-string I remember reading somewhere that the bit is found by iterating through the memory block whereas accessing an array element is constant. is it possible that these constants are so large that it is cheaper to linearly scan through memory blocks or did i mixed up something (Which is probably the case). Could you please educate me a "bit" :)

Thank you

baxy

i'm wondering now is where does the speed come from.

Perhaps the simplest way to demonstrate the difference is to look at the number of opcodes generated in order to compare and count two sets of 64 bits stored as: two arrays; two strings of ascii 1s and 0s; two bitstrings of 64 bits each. You don't need to understand the opcodes to see the reduction.

Moving as much of the work (looping) into the optimised, compiled-C, opcodes just saves huge swaths of time and processor:

1. Arrays:
```C:\test>perl -MO=Terse -E"@a=map{int rand 2}1..64;@b=map{int rand 2}1.
+.64; for my\$a(@a){ for my \$b(@b){ \$a==\$b and ++\$count }}"
LISTOP (0x34e7c58) leave [1]
OP (0x34eec40) enter
COP (0x34e7c98) nextstate
BINOP (0x34e7d00) aassign [9]
UNOP (0x34e7d70) null [142]
OP (0x34e7d40) pushmark
LOGOP (0x34e7e90) mapwhile [8]
LISTOP (0x34e7f00) mapstart
OP (0x34e7ed0) pushmark
UNOP (0x34e7e58) null
UNOP (0x34e7f40) null
LISTOP (0x34e80d0) scope
OP (0x34e8110) null [177]
UNOP (0x34e8178) int [4]
UNOP (0x34e81b0) rand [3]
SVOP (0x34e81e8) const [7] IV
+(0x33cca88) 2
UNOP (0x34e7f78) rv2av
SVOP (0x34e7e20) const [26] AV (0x33c7570)
UNOP (0x34e7de0) null [142]
OP (0x34e7db0) pushmark
UNOP (0x34e8220) rv2av [2]
PADOP (0x34e8258) gv  GV (0xa76c8) *a
COP (0x34e7660) nextstate
BINOP (0x34e76c8) aassign [18]
UNOP (0x34e7738) null [142]
OP (0x34e7708) pushmark
LOGOP (0x34e7858) mapwhile [17]
LISTOP (0x34e78c8) mapstart
OP (0x34e7898) pushmark
UNOP (0x34e7820) null
UNOP (0x34e7908) null
LISTOP (0x34e7a98) scope
UNOP (0x34e7b40) int [13]
UNOP (0x34e7b78) rand [12]
SVOP (0x34e7bb0) const [16] IV
+ (0x33c6e30) 2
UNOP (0x34e7940) rv2av
SVOP (0x34e77e8) const [27] AV (0x33c6830)
UNOP (0x34e77a8) null [142]
OP (0x34e7778) pushmark
UNOP (0x34e7be8) rv2av [11]
PADOP (0x34e7c20) gv  GV (0x33c6f40) *b
COP (0x34eecb0) nextstate
BINOP (0x34eed18) leaveloop
LOOP (0x34eee30) enteriter [19]
OP (0x34eee88) null [3]
UNOP (0x34eef28) null [142]
OP (0x34eeef8) pushmark
UNOP (0x34ef568) rv2av [21]
PADOP (0x34e75b8) gv  GV (0xa76c8) *a
UNOP (0x34eed58) null
LOGOP (0x34eed90) and
OP (0x34eee00) iter
LISTOP (0x34eef68) lineseq
COP (0x34eefa8) nextstate
BINOP (0x34ef010) leaveloop
LOOP (0x34ef128) enteriter [22]
OP (0x34ef180) null [3]
UNOP (0x34ef220) null [142]
OP (0x34ef1f0) pushmark
UNOP (0x34ef4c8) rv2av [24]
+0) *b
UNOP (0x34ef050) null
LOGOP (0x34ef088) and
OP (0x34ef0f8) iter
LISTOP (0x34ef260) lineseq
COP (0x34ef2a0) nextstate
UNOP (0x34ef308) null
LOGOP (0x34ef340) and
BINOP (0x34ef428) eq
+19]
+22]
UNOP (0x34ef380) preinc
UNOP (0x34ef3b8) null
+[15]
+gvsv  GV (0x33c5ed0) *count
OP (0x34ef0c8) unstack
OP (0x34eedd0) unstack
-e syntax OK
2. Strings:
```C:\test>perl -MO=Terse -E"\$a=join'',map{int rand 2}1..64;@b=map{int ra
+nd 2}1..64; \$count=(\$a&\$b)=~tr[1][]"
LISTOP (0x3447bc0) leave [1]
OP (0x344f178) enter
COP (0x3447c00) nextstate
BINOP (0x3447c68) sassign
LISTOP (0x3447cd8) join [8]
OP (0x3447ca8) pushmark
SVOP (0x3448118) const [22] PV (0x332ca20) ""
LOGOP (0x3447d88) mapwhile [7]
LISTOP (0x3447df8) mapstart
OP (0x3447dc8) pushmark
UNOP (0x3447d50) null
UNOP (0x3447e38) null
LISTOP (0x3447fc8) scope
OP (0x3448008) null [177]
UNOP (0x3448070) int [3]
UNOP (0x34480a8) rand [2]
SVOP (0x34480e0) const [6] IV
+(0x332cb58) 2
UNOP (0x3447e70) rv2av
SVOP (0x3447d18) const [23] AV (0x3327640)
UNOP (0x3448150) null [15]
PADOP (0x3448188) gvsv  GV (0xa76a8) *a
COP (0x34475c8) nextstate
BINOP (0x3447630) aassign [17]
UNOP (0x34476a0) null [142]
OP (0x3447670) pushmark
LOGOP (0x34477c0) mapwhile [16]
LISTOP (0x3447830) mapstart
OP (0x3447800) pushmark
UNOP (0x3447788) null
UNOP (0x3447870) null
LISTOP (0x3447a00) scope
OP (0x3447a40) null [177]
UNOP (0x3447aa8) int [12]
UNOP (0x3447ae0) rand [11]
SVOP (0x3447b18) const [15] IV
+ (0x3326f00) 2
UNOP (0x34478a8) rv2av
SVOP (0x3447750) const [24] AV (0x3326900)
UNOP (0x3447710) null [142]
OP (0x34476e0) pushmark
UNOP (0x3447b50) rv2av [10]
PADOP (0x3447b88) gv  GV (0x3327010) *b
COP (0x344f1e8) nextstate
BINOP (0x344f250) sassign
UNOP (0x344f290) null
BINOP (0x344f3e8) bit_and [21]
UNOP (0x344f498) null [15]
PADOP (0x34474e0) gvsv  GV (0xa76a8) *a
UNOP (0x344f428) null [15]
PADOP (0x344f460) gvsv  GV (0x3327010) *b
PVOP (0x344f3b0) trans
UNOP (0x3447518) null [15]
PADOP (0x3447550) gvsv  GV (0x33262d0) *count
-e syntax OK

3. Bits:
```C:\test>perl -MO=Terse -E"\$a=int rand 2**64;\$b=int rand 2**64; \$count
+= unpack '%32b*', \$a & \$b"
LISTOP (0x33e7460) leave [1]
OP (0x33e6e60) enter
COP (0x33e74a0) nextstate
BINOP (0x33e7508) sassign
UNOP (0x33e7548) int [4]
UNOP (0x33e7580) rand [3]
SVOP (0x33e75b8) const [13] NV (0x32ca498) 1.844674407
+37096e+019
UNOP (0x33e76a0) null [15]
PADOP (0x33e76d8) gvsv  GV (0x107668) *a
COP (0x33e71f0) nextstate
BINOP (0x33e7258) sassign
UNOP (0x33e7298) int [8]
UNOP (0x33e72d0) rand [7]
SVOP (0x33e7308) const [14] NV (0x32ca5a0) 1.844674407
+37096e+019
UNOP (0x33e73f0) null [15]
PADOP (0x33e7428) gvsv  GV (0x32ca510) *b
COP (0x33e6ed0) nextstate
BINOP (0x33e6f38) sassign
LISTOP (0x33e6fa8) unpack
OP (0x33e6f78) null [3]
SVOP (0x33e7108) const [15] PV (0x32ca600) "%32b*"
BINOP (0x33e6fe8) bit_and [12]
UNOP (0x33e7098) null [15]
PADOP (0x33e70d0) gvsv  GV (0x107668) *a
UNOP (0x33e7028) null [15]
PADOP (0x33e7060) gvsv  GV (0x32ca510) *b
UNOP (0x33e7140) null [15]
PADOP (0x33e7178) gvsv  GV (0x32ca5d0) *count
-e syntax OK

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Just be careful to create your data as bitstrings in the first place. If you create arrays and then turn them into bitstrings to do the comparison, then it is not that fast:

```use strict;
use warnings;
use Benchmark 'cmpthese';

sub create { map {rand() < \$_[1] ? 1 : 0} 1..\$_[0] }

sub compare2a { # first find 1s in x, then check in ys
my \$x = shift;
my \$n = shift;
my @nxs = grep { \$x->[\$_] } 0..\$n-1;
return map { scalar grep {\$_} @{\$_}[@nxs] } @_;
}

sub compare4 { # bitstrings
my \$x = shift;
\$x = pack 'b*', join '', @\$x;
return map { unpack '%32b*', ( \$x & pack 'b*', join'',@\$_ ) } @_;
}

my \$n  = 15000;
my \$p  = 0.005;
my \$ny = 10;
my @x = create \$n, \$p;
my @ys = map { [ create \$n, \$p ] } 1..\$ny;

my @r2a = compare2a \@x, \$n, @ys;
my @r4 = compare4 \@x, @ys;
print "compare2a: @r2a\n";
print "compare4:  @r4\n";

cmpthese( -5, {
compare2a => sub{ compare2a \@x, \$n, @ys },
compare4 => sub{ compare4 \@x, @ys },
}
);
Result:
```           Rate  compare4 compare2a
compare4  246/s        --      -55%
compare2a 543/s      120%        --
If you create arrays and then turn them into bitstrings [ everytime ] to do the comparison, then it is not that fast:

No shit Sherlock :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Create A New User
Node Status?
node history
Node Type: note [id://1067227]
help
Chatterbox?
 [Discipulus]: i cannot see any link to tye post.. [Lady_Aleena]: Hello. [marto]: teleconf + coffee, I'll post it in a few mins [beech]: fencepost? beech is protected with velcro [Lady_Aleena]: How is velco protective? It sticks to a lot of things.

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (11)
As of 2017-06-23 08:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How many monitors do you use while coding?

Results (539 votes). Check out past polls.