Perl Monk, Perl Meditation PerlMonks

Comparing two arrays

by baxy77bax (Deacon)
 on Dec 15, 2013 at 10:20 UTC Need Help??

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I realize this is a long shoot but maybe , just maybe someone willing to share can help me. The problem that I am trying to solve involves two integer arrays. One is static and one changes. Let x[] be the array whoes content does not change and y[] the one that changes. both arrays are 15000 int's long and they only contain values of 0 and 1. Example:

```x[0 0 0 0 1 1 0 0 0 0 1 0 ...]

y[1 0 0 0 1 0 1 0 1 0 0 0 ...]
there are about 5000000 y-arrays that i need to compare with 100000 x arrays. What I could do is to start from 0 and pairwise compare each entry(x1<=>y1, x1<=>y2 ... x2<=>y1, ...xn<=>yn). Though fast this is just not fast enough (by my calculations that could last several days on a cluster machine i have at my disposal and i need to repeat this app. 1000 times each month).
the other thing i could do is to preprocess y arrays so that i only store positions where 1's are. The for each y just check if x[y[i]] == 1 since there are about 500 times less 1's then 0's this is going to run much faster (and will save a lot of memory). And this is where i stopped. However, with this approach it takes me 1 sec to compare one x with all y arrays. When extrapolated to 100000 x arrays this turns out to be a lot of time, especially when considered that i need to repeat this 1000 times. Aditional information. My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array. Distribution od 1's in both y and x arrays is random (by the looks of it they maybe uniformly distributed) My goal is to speed this comparison up at least 3 times because then i can do something with it. Obviously to speed this up i need to do as less comparisons (checkups) as possible. I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs. I was also thinking of converting x array to a bit array but am not sure how will this help me speed-wise. I did rewrite this in c and i gained a lot of speed but still not enough. As you can see I am stuck. So if anyone has any input for me please do not hesitate to post, no matter how crazy or slow the idea is. And i apologize if this is not strictly speaking a Perl (syntax, code -wise) related issue.

Maybe there should be a subforum for this type of questions/discussions as Perl is being used more and more for theoretical/conceptual problem solving tasks , because you don't have to worry about technicalities and since a lot of people that do this type of work are coming from c there is no syntax leap.

Anyway, thank you

baxy

Replies are listed 'Best First'.
Re: Comparing two arrays
by BrowserUk (Pope) on Dec 15, 2013 at 12:16 UTC
My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array.

Convert your arrays of 0s 1s to bit-strings, then use bitwise-& and unpack '%32b*' to count the equivalences and you can do this 300+ times faster than comparing the arrays:

```#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];
use Data::Dump qw[ pp ]; \$Data::Dump::WIDTH = 500;

our \$I //= -1;
our \$N //= 1000;

our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;
our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. \$N;

our @xStrings = map{ join '', @\$_  } @xArrays;
our @yStrings = map{ join '', @\$_  } @yArrays;

our @xBits = map{ pack 'b*', \$_ } @xStrings;
our @yBits = map{ pack 'b*', \$_ } @yStrings;

cmpthese \$I, {
array => q[
my %top10s;
for my \$x ( 0 .. \$#xArrays ) {
for my \$y ( 0 .. \$#yArrays ) {
my \$count = 0;
\$xArrays[\$x][\$_] == 1 && \$yArrays[\$y][\$_] == 1 and ++\$
+count for 0 .. \$#{ \$xArrays[ 0 ] };
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp ' arrays: ', %top10s;
],
strings => q[
my %top10s;
for my \$x ( 0 .. \$#xStrings ) {
for my \$y ( 0 .. \$#yStrings ) {
my \$count = ( \$xStrings[\$x] & \$yStrings[\$y] ) =~ tr[1]
+[];
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s  )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp 'strings: ', %top10s;
],
bits => q[
my %top10s;
for my \$x ( 0 .. \$#xBits ) {
for my \$y ( 0 .. \$#yBits ) {
my \$count = unpack '%32b*', ( \$xBits[\$x] & \$yBits[\$y]
+);
\$top10s{"\$x:\$y"} = \$count;
my \$discard = ( sort{ \$top10s{\$a} <=> \$top10s{\$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete \$top10s{\$discard};
}
}
\$I == 1 and pp '   bits: ', %top10s;
],
};

__END__
C:\test>1067218 -N=100
Rate   array strings    bits
array   1.95e-002/s      --    -98%   -100%
strings      1.08/s   5417%      --    -82%
bits         5.97/s  30510%    455%      --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
thank you so much for the code and the benchmark, after seeing this i'll try to implement the strategy. However what i'm wondering now is where does the speed come from. When I search for a certain bit in a bit-string I remember reading somewhere that the bit is found by iterating through the memory block whereas accessing an array element is constant. is it possible that these constants are so large that it is cheaper to linearly scan through memory blocks or did i mixed up something (Which is probably the case). Could you please educate me a "bit" :)

Thank you

baxy

i'm wondering now is where does the speed come from.

Perhaps the simplest way to demonstrate the difference is to look at the number of opcodes generated in order to compare and count two sets of 64 bits stored as: two arrays; two strings of ascii 1s and 0s; two bitstrings of 64 bits each. You don't need to understand the opcodes to see the reduction.

Moving as much of the work (looping) into the optimised, compiled-C, opcodes just saves huge swaths of time and processor:

1. Arrays:
```C:\test>perl -MO=Terse -E"@a=map{int rand 2}1..64;@b=map{int rand 2}1.
+.64; for my\$a(@a){ for my \$b(@b){ \$a==\$b and ++\$count }}"
LISTOP (0x34e7c58) leave [1]
OP (0x34eec40) enter
COP (0x34e7c98) nextstate
BINOP (0x34e7d00) aassign [9]
UNOP (0x34e7d70) null [142]
OP (0x34e7d40) pushmark
LOGOP (0x34e7e90) mapwhile [8]
LISTOP (0x34e7f00) mapstart
OP (0x34e7ed0) pushmark
UNOP (0x34e7e58) null
UNOP (0x34e7f40) null
LISTOP (0x34e80d0) scope
OP (0x34e8110) null [177]
UNOP (0x34e8178) int [4]
UNOP (0x34e81b0) rand [3]
SVOP (0x34e81e8) const [7] IV
+(0x33cca88) 2
UNOP (0x34e7f78) rv2av
SVOP (0x34e7e20) const [26] AV (0x33c7570)
UNOP (0x34e7de0) null [142]
OP (0x34e7db0) pushmark
UNOP (0x34e8220) rv2av [2]
PADOP (0x34e8258) gv  GV (0xa76c8) *a
COP (0x34e7660) nextstate
BINOP (0x34e76c8) aassign [18]
UNOP (0x34e7738) null [142]
OP (0x34e7708) pushmark
LOGOP (0x34e7858) mapwhile [17]
LISTOP (0x34e78c8) mapstart
OP (0x34e7898) pushmark
UNOP (0x34e7820) null
UNOP (0x34e7908) null
LISTOP (0x34e7a98) scope
UNOP (0x34e7b40) int [13]
UNOP (0x34e7b78) rand [12]
SVOP (0x34e7bb0) const [16] IV
+ (0x33c6e30) 2
UNOP (0x34e7940) rv2av
SVOP (0x34e77e8) const [27] AV (0x33c6830)
UNOP (0x34e77a8) null [142]
OP (0x34e7778) pushmark
UNOP (0x34e7be8) rv2av [11]
PADOP (0x34e7c20) gv  GV (0x33c6f40) *b
COP (0x34eecb0) nextstate
BINOP (0x34eed18) leaveloop
LOOP (0x34eee30) enteriter [19]
OP (0x34eee88) null [3]
UNOP (0x34eef28) null [142]
OP (0x34eeef8) pushmark
UNOP (0x34ef568) rv2av [21]
PADOP (0x34e75b8) gv  GV (0xa76c8) *a
UNOP (0x34eed58) null
LOGOP (0x34eed90) and
OP (0x34eee00) iter
LISTOP (0x34eef68) lineseq
COP (0x34eefa8) nextstate
BINOP (0x34ef010) leaveloop
LOOP (0x34ef128) enteriter [22]
OP (0x34ef180) null [3]
UNOP (0x34ef220) null [142]
OP (0x34ef1f0) pushmark
UNOP (0x34ef4c8) rv2av [24]
+0) *b
UNOP (0x34ef050) null
LOGOP (0x34ef088) and
OP (0x34ef0f8) iter
LISTOP (0x34ef260) lineseq
COP (0x34ef2a0) nextstate
UNOP (0x34ef308) null
LOGOP (0x34ef340) and
BINOP (0x34ef428) eq
+19]
+22]
UNOP (0x34ef380) preinc
UNOP (0x34ef3b8) null
+[15]
+gvsv  GV (0x33c5ed0) *count
OP (0x34ef0c8) unstack
OP (0x34eedd0) unstack
-e syntax OK
2. Strings:
```C:\test>perl -MO=Terse -E"\$a=join'',map{int rand 2}1..64;@b=map{int ra
+nd 2}1..64; \$count=(\$a&\$b)=~tr[1][]"
LISTOP (0x3447bc0) leave [1]
OP (0x344f178) enter
COP (0x3447c00) nextstate
BINOP (0x3447c68) sassign
LISTOP (0x3447cd8) join [8]
OP (0x3447ca8) pushmark
SVOP (0x3448118) const [22] PV (0x332ca20) ""
LOGOP (0x3447d88) mapwhile [7]
LISTOP (0x3447df8) mapstart
OP (0x3447dc8) pushmark
UNOP (0x3447d50) null
UNOP (0x3447e38) null
LISTOP (0x3447fc8) scope
OP (0x3448008) null [177]
UNOP (0x3448070) int [3]
UNOP (0x34480a8) rand [2]
SVOP (0x34480e0) const [6] IV
+(0x332cb58) 2
UNOP (0x3447e70) rv2av
SVOP (0x3447d18) const [23] AV (0x3327640)
UNOP (0x3448150) null [15]
PADOP (0x3448188) gvsv  GV (0xa76a8) *a
COP (0x34475c8) nextstate
BINOP (0x3447630) aassign [17]
UNOP (0x34476a0) null [142]
OP (0x3447670) pushmark
LOGOP (0x34477c0) mapwhile [16]
LISTOP (0x3447830) mapstart
OP (0x3447800) pushmark
UNOP (0x3447788) null
UNOP (0x3447870) null
LISTOP (0x3447a00) scope
OP (0x3447a40) null [177]
UNOP (0x3447aa8) int [12]
UNOP (0x3447ae0) rand [11]
SVOP (0x3447b18) const [15] IV
+ (0x3326f00) 2
UNOP (0x34478a8) rv2av
SVOP (0x3447750) const [24] AV (0x3326900)
UNOP (0x3447710) null [142]
OP (0x34476e0) pushmark
UNOP (0x3447b50) rv2av [10]
PADOP (0x3447b88) gv  GV (0x3327010) *b
COP (0x344f1e8) nextstate
BINOP (0x344f250) sassign
UNOP (0x344f290) null
BINOP (0x344f3e8) bit_and [21]
UNOP (0x344f498) null [15]
PADOP (0x34474e0) gvsv  GV (0xa76a8) *a
UNOP (0x344f428) null [15]
PADOP (0x344f460) gvsv  GV (0x3327010) *b
PVOP (0x344f3b0) trans
UNOP (0x3447518) null [15]
PADOP (0x3447550) gvsv  GV (0x33262d0) *count
-e syntax OK
3. Bits:
```C:\test>perl -MO=Terse -E"\$a=int rand 2**64;\$b=int rand 2**64; \$count
+= unpack '%32b*', \$a & \$b"
LISTOP (0x33e7460) leave [1]
OP (0x33e6e60) enter
COP (0x33e74a0) nextstate
BINOP (0x33e7508) sassign
UNOP (0x33e7548) int [4]
UNOP (0x33e7580) rand [3]
SVOP (0x33e75b8) const [13] NV (0x32ca498) 1.844674407
+37096e+019
UNOP (0x33e76a0) null [15]
PADOP (0x33e76d8) gvsv  GV (0x107668) *a
COP (0x33e71f0) nextstate
BINOP (0x33e7258) sassign
UNOP (0x33e7298) int [8]
UNOP (0x33e72d0) rand [7]
SVOP (0x33e7308) const [14] NV (0x32ca5a0) 1.844674407
+37096e+019
UNOP (0x33e73f0) null [15]
PADOP (0x33e7428) gvsv  GV (0x32ca510) *b
COP (0x33e6ed0) nextstate
BINOP (0x33e6f38) sassign
LISTOP (0x33e6fa8) unpack
OP (0x33e6f78) null [3]
SVOP (0x33e7108) const [15] PV (0x32ca600) "%32b*"
BINOP (0x33e6fe8) bit_and [12]
UNOP (0x33e7098) null [15]
PADOP (0x33e70d0) gvsv  GV (0x107668) *a
UNOP (0x33e7028) null [15]
PADOP (0x33e7060) gvsv  GV (0x32ca510) *b
UNOP (0x33e7140) null [15]
PADOP (0x33e7178) gvsv  GV (0x32ca5d0) *count
-e syntax OK

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Just be careful to create your data as bitstrings in the first place. If you create arrays and then turn them into bitstrings to do the comparison, then it is not that fast:

```use strict;
use warnings;
use Benchmark 'cmpthese';

sub create { map {rand() < \$_[1] ? 1 : 0} 1..\$_[0] }

sub compare2a { # first find 1s in x, then check in ys
my \$x = shift;
my \$n = shift;
my @nxs = grep { \$x->[\$_] } 0..\$n-1;
return map { scalar grep {\$_} @{\$_}[@nxs] } @_;
}

sub compare4 { # bitstrings
my \$x = shift;
\$x = pack 'b*', join '', @\$x;
return map { unpack '%32b*', ( \$x & pack 'b*', join'',@\$_ ) } @_;
}

my \$n  = 15000;
my \$p  = 0.005;
my \$ny = 10;
my @x = create \$n, \$p;
my @ys = map { [ create \$n, \$p ] } 1..\$ny;

my @r2a = compare2a \@x, \$n, @ys;
my @r4 = compare4 \@x, @ys;
print "compare2a: @r2a\n";
print "compare4:  @r4\n";

cmpthese( -5, {
compare2a => sub{ compare2a \@x, \$n, @ys },
compare4 => sub{ compare4 \@x, @ys },
}
);
Result:
```           Rate  compare4 compare2a
compare4  246/s        --      -55%
compare2a 543/s      120%        --
Re: Comparing two arrays
by hdb (Monsignor) on Dec 15, 2013 at 10:38 UTC

How about turning the arrays into strings and then using boolean & on them, using tr to count the 1s?

```my @x = qw( 0 0 0 0 1 1 0 0 0 0 1 0 );
my @y = qw( 1 0 0 0 1 0 1 0 1 0 1 0 );

my \$x = join '', @x;
my \$y = join '', @y;

my \$compare = \$x & \$y;
my \$joint = \$compare =~ tr/1/1/;
print "\$joint\n";

Can you throw this into your benchmarking? @x obviously would need to be turned into a string only once.

Hi, thank you making a suggestion. Correct me if I'm wrong but if i turn array into string what i am basically doing is creating a character array (1 byte per character ) boolean & will then rewrite my two arrays into one consensus array which means number of computational steps that equal to the size of the smaller array(best case). but shouldn't that be slower then doing: the number of random access array checkups that equal to the total number of 1's in y array (which is ca. 500 times smaller then the size of smaller array (between the two)). It would definitely reduce my memory but don't see how would this increase speed. Basically what you are suggesting is equal to converting my arrays into bit-arrays and then computing the intersect. And i truly don't see why would this be faster. could someone maybe discuss this approach with me a bit further before i start creating bit/character arrays.

Thank you so much for engaging the conversation (++)

You are right. See the following benchmarks:

```use strict;
use warnings;
use Benchmark 'cmpthese';

sub create { map {rand() < \$_[1] ? 1 : 0} 1..\$_[0] }

sub compare1 { # naive
my \$x = shift;
my \$n = shift;
return map { my \$y=\$_; scalar grep { \$y->[\$_] == 1 and \$x->[\$_] == 1
+ } 0..\$n-1 } @_;
}

sub compare2 { # first find 1s in x, then check in ys
my \$x = shift;
my \$n = shift;
my @nxs = grep { \$x->[\$_] } 0..\$n-1;
return map { my \$y=\$_; scalar grep { \$y->[\$_] == 1 } @nxs } @_ ;
}

sub compare3 { # stringify
my \$x = shift;
\$x = join '', @\$x;
return map { my \$j = \$x & join'',@\$_; \$j =~ tr/1/1/ } @_;
}

my \$n  = 15000;
my \$p  = 0.005;
my \$ny = 10;
my @x = create \$n, \$p;
my @ys = map { [ create \$n, \$p ] } 1..\$ny;

my @r1 = compare1 \@x, \$n, @ys;
my @r2 = compare2 \@x, \$n, @ys;
my @r3 = compare3 \@x, @ys;
print "compare1: @r1\n";
print "compare2: @r2\n";
print "compare3: @r3\n";

cmpthese( -5, { compare1 => sub{ compare1 \@x, \$n, @ys },
compare2 => sub{ compare2 \@x, \$n, @ys },
compare3 => sub{ compare3 \@x, @ys },
}
);
Results:
```           Rate compare1 compare3 compare2
compare1 48.8/s       --     -81%     -91%
compare3  253/s     420%       --     -53%
compare2  537/s    1002%     112%       --
Re: Comparing two arrays
by Lennotoecom (Pilgrim) on Dec 15, 2013 at 12:13 UTC
1.
please tell what exactly you want to see at the end
@a = 1 0 1
@b = 1 0 0
@c = ?
2.
15000int's long means length of an each array?
15000 elements in each array?
oh i missed that part , sorry
```@a = 1 0 1
@b = 1 0 0
@c = 1 0 0 # unnecessary as array , just the final count of 1's in arr
+ay @c
what i need is the total number of matched 1's so in the example above i just need number 1 as a result.

" 15000int's long means length of an each array?
15000 elements in each array? "

yes 15000 integer values in any array that i'm dealing with. and yes 15000 (1's or 0's). not 15000 of just 1's not 15000 of just 0's and not #1's + #0's = 15000. 15000 elements over the alphabet of size 2: array(x or y) = {x | x in A} where A={1,0}

Re: Comparing two arrays
by oiskuu (Hermit) on Dec 15, 2013 at 18:57 UTC

Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.

However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.

Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. Make an array N[X] that counts hits per vector X. These structures you can keep in memory.

Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)

I wrote a crude implementation of merge(sort) to estimate the matching speed. 30 buckets each with ~10k sorted, "l/l" packed numbers (representing y-vector ID's). Merge and scan takes about 5 ms (unless I've botched it somewhere). Full search is ten minutes at that rate. Hm.

Update. SSE4.1 version doubled the performance.

Re: Comparing two arrays
by educated_foo (Vicar) on Dec 15, 2013 at 12:57 UTC
In addition to what others have said -- use packed bit-arrays -- you can sort the X arrays first, then do a binary search to find the closest X for each Y (~16 comparisons apiece).
you can sort the X arrays first, then do a binary search to find the closest X for each Y

Sort by what criteria? "closest" by what criteria?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Sort by what criteria? "closest" by what criteria?
That obviously depends upon what the OP is trying to do. Maybe he wants lexicographic sorting, or maybe some other mapping; it's not specified in the question, so I didn't guess.
Re: Comparing two arrays
by Anonymous Monk on Dec 15, 2013 at 21:43 UTC
I would try to use a hash to some advantage. For example, total up the sum of the entries and perhaps the index of the leftmost '1', then make that into a hash-key. The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them. The hash won't tell you which one (if any) matches, but only entries in that chosen bucket need be considered.

The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them.

This was my thought too; with that many items to cross-compare, doing various "cheats" to cut down on how many you actually need to search is profitable. Even something as simple as just keeping counts of the number of 1's in each array can give you a real cheap method of asking "is it possible that these two arrays are the same?"

You'll get false positives (probably a lot of them), but you can guarantee no false negatives, so that can chop orders of magnitude off the problem. With a 15,000-element array, you know there's some number between 0 and 15000 1's; if they're evenly distributed (which they won't be even close, but it's a starting point to guess) then you're immediately dropping from "test these 5 million" to "test these 333", which is a huge speedup. Even if the distribution is off by an order or two of magnitude, dropping to 3 or 30,000 comparisons is a giant gain from 5 million.

Then adding some additional cheap to store/compare (maybe not necessarily calculate, but you may be able do this at the same time as the arrays get built for minimal added expense) hints like the leftmost-1-index or rightmost-1-index or longer-run-of-1's or some other thing meaningful to your data set, can chop that down even further. At what point the diminishing returns start hurting you is situation specific; it would take knowing your data and possibly some experimentation to find it.

In essence, these sort of universe-narrowing tricks can be seen as variations on the theme of Bloom filters. A quick glance at cpan shows up some Bloom filter implementations, but it's hard to say whether they're well suited to this particular type/scale of usage. But they may on deeper investigation prove to be so...

Re: Comparing two arrays
by sundialsvc4 (Abbot) on Dec 16, 2013 at 13:05 UTC

Here’s another way to do it:   scan the arrays to convert them to a vector using some form of “run-length encoding.”   For instance, a list containing the position of the next '1' and the number of '1's that occur.   Now for the magic:   use a Digest algorithm of some sort ... could be CRC, SHA1, could be anything ... to convert that vector into a single value that can be used as a hash.

Actually, if you use a truly-decent message digest algorithm, like SHA1, you probably won’t have to do anything else, because one and only one vector will map to the same hash value.   You have to preprocess each array one time to compute its message-digest hash, but from there on out you might not have to examine the array contents at all.   A comparison of the digests ought to be sufficient.   Aside from the overhead of passing through each array one time to calculate its digest, the overhead of laborious comparison ... disappears.

If you're looking for duplicate vectors, digesting can greatly reduce the number of comparisons, but it won't let you escape them altogether: http://en.wikipedia.org/wiki/Pigeonhole_principle.

Update: s/While/If you're looking for duplicate vectors/ and changed conjunction so the sentence still reads well.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

While digesting can greatly reduce the number of comparisons

That would only be true if the OP was looking for exact matches. He isn't.

He's looking for the best matches, where 'best' is defined in terms of the number of set bits in matching positions. No hashing, digesting nor sorting approach to this problem is possible.

Every X must be fully compared against every Y.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
one and only one vector will map to the same hash value
Do you mean there are no hash collisions for SHA1?
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
the overhead of laborious comparison ... disappears.

Total garbage.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
A reply falls below the community's threshold of quality. You may see it by logging in.

This method was considered and discarded in the OP:

I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067218]
Approved by hdb
Front-paged by snoopy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2021-04-19 13:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?