Pathologically Eclectic Rubbish Lister PerlMonks

### Re^2: Comparing two arrays

by roboticus (Chancellor)
 on Dec 16, 2013 at 13:43 UTC ( #1067330=note: print w/ replies, xml ) Need Help??

in reply to Re: Comparing two arrays

If you're looking for duplicate vectors, digesting can greatly reduce the number of comparisons, but it won't let you escape them altogether: http://en.wikipedia.org/wiki/Pigeonhole_principle.

Update: s/While/If you're looking for duplicate vectors/ and changed conjunction so the sentence still reads well.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re^2: Comparing two arrays
Replies are listed 'Best First'.
Re^3: Comparing two arrays
by BrowserUk (Pope) on Dec 16, 2013 at 13:53 UTC
While digesting can greatly reduce the number of comparisons

That would only be true if the OP was looking for exact matches. He isn't.

He's looking for the best matches, where 'best' is defined in terms of the number of set bits in matching positions. No hashing, digesting nor sorting approach to this problem is possible.

Every X must be fully compared against every Y.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

That's certainly true, for the original problem. I was intending to refute the statement "because one and only one vector will map to the same hash value", but I didn't take the rest of the thread into context. (I corrected the node accordingly.)

However, you needn't compare each X fully against each Y either, either. Just like your Bloom filter project a while ago, there may be ways to transform the problem so we don't have to explicitly compare vectors against each other. I've been working a bit on one, but I haven't posted it because the performance is currently worse than direct comparisons, and most of the changes I make to it slow it down even further.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Feel free to add your method to my benchmark. We can ignore the timings.

If you set the iteration count to 1 (-I=1), then it will display the top 10 Ys matching each X along with the number of 1s they have in common. If you can match the results those three methods all concur on without doing the full X x Y x 15_000 bits comparison, you'll have proved your case:

```#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];
use Data::Dump qw[ pp ]; \$Data::Dump::WIDTH = 500;

our \$I //= -1;
our \$N //= 1000;

our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;
our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;

our @xStrings = map{ join '', @\$_  } @xArrays;
our @yStrings = map{ join '', @\$_  } @yArrays;

our @xBits = map{ pack 'b*', \$_ } @xStrings;
our @yBits = map{ pack 'b*', \$_ } @yStrings;

cmpthese \$I, {
array => q[
my %top10s;
for my \$x ( 0 .. \$#xArrays ) {
for my \$y ( 0 .. \$#yArrays ) {
my \$count = 0;
\$xArrays[\$x][\$_] == 1 && \$yArrays[\$y][\$_] == 1 and ++\$
+count for 0 .. \$#{ \$xArrays[ 0 ] };
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp ' arrays: ', %top10s;
],
strings => q[
my %top10s;
for my \$x ( 0 .. \$#xStrings ) {
for my \$y ( 0 .. \$#yStrings ) {
my \$count = ( \$xStrings[\$x] & \$yStrings[\$y] ) =~ tr[1]
+[];
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s  )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp 'strings: ', %top10s;
],
bits => q[
my %top10s;
for my \$x ( 0 .. \$#xBits ) {
for my \$y ( 0 .. \$#yBits ) {
my \$count = unpack '%32b*', ( \$xBits[\$x] & \$yBits[\$y]
+);
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp '   bits: ', %top10s;
],
};

__END__
C:\test>1067218 -I=1 -N=100
(" arrays: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390
+9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56
+", 3917, "54:58", 3914)

(warning: too few iterations for a reliable count)

("   bits: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390
+9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56
+", 3917, "54:58", 3914)

(warning: too few iterations for a reliable count)

("strings: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390
+9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56
+", 3917, "54:58", 3914)

(warning: too few iterations for a reliable count)

Rate   array strings    bits
array   1.98e-002/s      --    -98%   -100%
strings      1.12/s   5574%      --    -82%
bits         6.41/s  32272%    471%      --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
div
However, you needn't compare each X fully against each Y either, either. Just like your Bloom filter project a while ago, there may be ways to transform the problem so we don't have to explicitly compare vectors against each other.

I really don't believe that is possible. Please prove me wrong?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Create A New User
Node Status?
node history
Node Type: note [id://1067330]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (10)
As of 2015-11-30 20:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?