Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
laziness, impatience, and hubris
 
PerlMonks  

Comparing two arrays

by baxy77bax (Chaplain)
on Dec 15, 2013 at 10:20 UTC ( #1067218=perlquestion: print w/ replies, xml ) Need Help??
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I realize this is a long shoot but maybe , just maybe someone willing to share can help me. The problem that I am trying to solve involves two integer arrays. One is static and one changes. Let x[] be the array whoes content does not change and y[] the one that changes. both arrays are 15000 int's long and they only contain values of 0 and 1. Example:

x[0 0 0 0 1 1 0 0 0 0 1 0 ...] y[1 0 0 0 1 0 1 0 1 0 0 0 ...]
there are about 5000000 y-arrays that i need to compare with 100000 x arrays. What I could do is to start from 0 and pairwise compare each entry(x1<=>y1, x1<=>y2 ... x2<=>y1, ...xn<=>yn). Though fast this is just not fast enough (by my calculations that could last several days on a cluster machine i have at my disposal and i need to repeat this app. 1000 times each month).
the other thing i could do is to preprocess y arrays so that i only store positions where 1's are. The for each y just check if x[y[i]] == 1 since there are about 500 times less 1's then 0's this is going to run much faster (and will save a lot of memory). And this is where i stopped. However, with this approach it takes me 1 sec to compare one x with all y arrays. When extrapolated to 100000 x arrays this turns out to be a lot of time, especially when considered that i need to repeat this 1000 times. Aditional information. My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array. Distribution od 1's in both y and x arrays is random (by the looks of it they maybe uniformly distributed) My goal is to speed this comparison up at least 3 times because then i can do something with it. Obviously to speed this up i need to do as less comparisons (checkups) as possible. I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs. I was also thinking of converting x array to a bit array but am not sure how will this help me speed-wise. I did rewrite this in c and i gained a lot of speed but still not enough. As you can see I am stuck. So if anyone has any input for me please do not hesitate to post, no matter how crazy or slow the idea is. And i apologize if this is not strictly speaking a Perl (syntax, code -wise) related issue.

Maybe there should be a subforum for this type of questions/discussions as Perl is being used more and more for theoretical/conceptual problem solving tasks , because you don't have to worry about technicalities and since a lot of people that do this type of work are coming from c there is no syntax leap.

Anyway, thank you

baxy

Comment on Comparing two arrays
Select or Download Code
Re: Comparing two arrays
by hdb (Parson) on Dec 15, 2013 at 10:38 UTC

    How about turning the arrays into strings and then using boolean & on them, using tr to count the 1s?

    my @x = qw( 0 0 0 0 1 1 0 0 0 0 1 0 ); my @y = qw( 1 0 0 0 1 0 1 0 1 0 1 0 ); my $x = join '', @x; my $y = join '', @y; my $compare = $x & $y; my $joint = $compare =~ tr/1/1/; print "$joint\n";

    Can you throw this into your benchmarking? @x obviously would need to be turned into a string only once.

      Hi, thank you making a suggestion. Correct me if I'm wrong but if i turn array into string what i am basically doing is creating a character array (1 byte per character ) boolean & will then rewrite my two arrays into one consensus array which means number of computational steps that equal to the size of the smaller array(best case). but shouldn't that be slower then doing: the number of random access array checkups that equal to the total number of 1's in y array (which is ca. 500 times smaller then the size of smaller array (between the two)). It would definitely reduce my memory but don't see how would this increase speed. Basically what you are suggesting is equal to converting my arrays into bit-arrays and then computing the intersect. And i truly don't see why would this be faster. could someone maybe discuss this approach with me a bit further before i start creating bit/character arrays.

      Thank you so much for engaging the conversation (++)

        You are right. See the following benchmarks:

        use strict; use warnings; use Benchmark 'cmpthese'; sub create { map {rand() < $_[1] ? 1 : 0} 1..$_[0] } sub compare1 { # naive my $x = shift; my $n = shift; return map { my $y=$_; scalar grep { $y->[$_] == 1 and $x->[$_] == 1 + } 0..$n-1 } @_; } sub compare2 { # first find 1s in x, then check in ys my $x = shift; my $n = shift; my @nxs = grep { $x->[$_] } 0..$n-1; return map { my $y=$_; scalar grep { $y->[$_] == 1 } @nxs } @_ ; } sub compare3 { # stringify my $x = shift; $x = join '', @$x; return map { my $j = $x & join'',@$_; $j =~ tr/1/1/ } @_; } my $n = 15000; my $p = 0.005; my $ny = 10; my @x = create $n, $p; my @ys = map { [ create $n, $p ] } 1..$ny; my @r1 = compare1 \@x, $n, @ys; my @r2 = compare2 \@x, $n, @ys; my @r3 = compare3 \@x, @ys; print "compare1: @r1\n"; print "compare2: @r2\n"; print "compare3: @r3\n"; cmpthese( -5, { compare1 => sub{ compare1 \@x, $n, @ys }, compare2 => sub{ compare2 \@x, $n, @ys }, compare3 => sub{ compare3 \@x, @ys }, } );
        Results:
        Rate compare1 compare3 compare2 compare1 48.8/s -- -81% -91% compare3 253/s 420% -- -53% compare2 537/s 1002% 112% --
Re: Comparing two arrays
by Lennotoecom (Monk) on Dec 15, 2013 at 12:13 UTC
    1.
    please tell what exactly you want to see at the end
    of your calculations?
    @a = 1 0 1
    @b = 1 0 0
    @c = ?
    2.
    15000int's long means length of an each array?
    15000 elements in each array?
      oh i missed that part , sorry
      @a = 1 0 1 @b = 1 0 0 @c = 1 0 0 # unnecessary as array , just the final count of 1's in arr +ay @c
      what i need is the total number of matched 1's so in the example above i just need number 1 as a result.

      " 15000int's long means length of an each array?
      15000 elements in each array? "

      yes 15000 integer values in any array that i'm dealing with. and yes 15000 (1's or 0's). not 15000 of just 1's not 15000 of just 0's and not #1's + #0's = 15000. 15000 elements over the alphabet of size 2: array(x or y) = {x | x in A} where A={1,0}

Re: Comparing two arrays
by BrowserUk (Pope) on Dec 15, 2013 at 12:16 UTC
    My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array.

    Convert your arrays of 0s 1s to bit-strings, then use bitwise-& and unpack '%32b*' to count the equivalences and you can do this 300+ times faster than comparing the arrays:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 500; our $I //= -1; our $N //= 1000; our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @xStrings = map{ join '', @$_ } @xArrays; our @yStrings = map{ join '', @$_ } @yArrays; our @xBits = map{ pack 'b*', $_ } @xStrings; our @yBits = map{ pack 'b*', $_ } @yStrings; cmpthese $I, { array => q[ my %top10s; for my $x ( 0 .. $#xArrays ) { for my $y ( 0 .. $#yArrays ) { my $count = 0; $xArrays[$x][$_] == 1 && $yArrays[$y][$_] == 1 and ++$ +count for 0 .. $#{ $xArrays[ 0 ] }; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' arrays: ', %top10s; ], strings => q[ my %top10s; for my $x ( 0 .. $#xStrings ) { for my $y ( 0 .. $#yStrings ) { my $count = ( $xStrings[$x] & $yStrings[$y] ) =~ tr[1] +[]; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp 'strings: ', %top10s; ], bits => q[ my %top10s; for my $x ( 0 .. $#xBits ) { for my $y ( 0 .. $#yBits ) { my $count = unpack '%32b*', ( $xBits[$x] & $yBits[$y] +); $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' bits: ', %top10s; ], }; __END__ C:\test>1067218 -N=100 Rate array strings bits array 1.95e-002/s -- -98% -100% strings 1.08/s 5417% -- -82% bits 5.97/s 30510% 455% --

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      thank you so much for the code and the benchmark, after seeing this i'll try to implement the strategy. However what i'm wondering now is where does the speed come from. When I search for a certain bit in a bit-string I remember reading somewhere that the bit is found by iterating through the memory block whereas accessing an array element is constant. is it possible that these constants are so large that it is cheaper to linearly scan through memory blocks or did i mixed up something (Which is probably the case). Could you please educate me a "bit" :)

      Thank you

      baxy

        i'm wondering now is where does the speed come from.

        Perhaps the simplest way to demonstrate the difference is to look at the number of opcodes generated in order to compare and count two sets of 64 bits stored as: two arrays; two strings of ascii 1s and 0s; two bitstrings of 64 bits each. You don't need to understand the opcodes to see the reduction.

        Moving as much of the work (looping) into the optimised, compiled-C, opcodes just saves huge swaths of time and processor:

        1. Arrays:
          C:\test>perl -MO=Terse -E"@a=map{int rand 2}1..64;@b=map{int rand 2}1. +.64; for my$a(@a){ for my $b(@b){ $a==$b and ++$count }}" LISTOP (0x34e7c58) leave [1] OP (0x34eec40) enter COP (0x34e7c98) nextstate BINOP (0x34e7d00) aassign [9] UNOP (0x34e7d70) null [142] OP (0x34e7d40) pushmark LOGOP (0x34e7e90) mapwhile [8] LISTOP (0x34e7f00) mapstart OP (0x34e7ed0) pushmark UNOP (0x34e7e58) null UNOP (0x34e7f40) null LISTOP (0x34e80d0) scope OP (0x34e8110) null [177] UNOP (0x34e8178) int [4] UNOP (0x34e81b0) rand [3] SVOP (0x34e81e8) const [7] IV +(0x33cca88) 2 UNOP (0x34e7f78) rv2av SVOP (0x34e7e20) const [26] AV (0x33c7570) UNOP (0x34e7de0) null [142] OP (0x34e7db0) pushmark UNOP (0x34e8220) rv2av [2] PADOP (0x34e8258) gv GV (0xa76c8) *a COP (0x34e7660) nextstate BINOP (0x34e76c8) aassign [18] UNOP (0x34e7738) null [142] OP (0x34e7708) pushmark LOGOP (0x34e7858) mapwhile [17] LISTOP (0x34e78c8) mapstart OP (0x34e7898) pushmark UNOP (0x34e7820) null UNOP (0x34e7908) null LISTOP (0x34e7a98) scope OP (0x34e7ad8) null [177] UNOP (0x34e7b40) int [13] UNOP (0x34e7b78) rand [12] SVOP (0x34e7bb0) const [16] IV + (0x33c6e30) 2 UNOP (0x34e7940) rv2av SVOP (0x34e77e8) const [27] AV (0x33c6830) UNOP (0x34e77a8) null [142] OP (0x34e7778) pushmark UNOP (0x34e7be8) rv2av [11] PADOP (0x34e7c20) gv GV (0x33c6f40) *b COP (0x34eecb0) nextstate BINOP (0x34eed18) leaveloop LOOP (0x34eee30) enteriter [19] OP (0x34eee88) null [3] UNOP (0x34eef28) null [142] OP (0x34eeef8) pushmark UNOP (0x34ef568) rv2av [21] PADOP (0x34e75b8) gv GV (0xa76c8) *a UNOP (0x34eed58) null LOGOP (0x34eed90) and OP (0x34eee00) iter LISTOP (0x34eef68) lineseq COP (0x34eefa8) nextstate BINOP (0x34ef010) leaveloop LOOP (0x34ef128) enteriter [22] OP (0x34ef180) null [3] UNOP (0x34ef220) null [142] OP (0x34ef1f0) pushmark UNOP (0x34ef4c8) rv2av [24] PADOP (0x34ef500) gv GV (0x33c6f4 +0) *b UNOP (0x34ef050) null LOGOP (0x34ef088) and OP (0x34ef0f8) iter LISTOP (0x34ef260) lineseq COP (0x34ef2a0) nextstate UNOP (0x34ef308) null LOGOP (0x34ef340) and BINOP (0x34ef428) eq OP (0x34ef498) padsv [ +19] OP (0x34ef468) padsv [ +22] UNOP (0x34ef380) preinc UNOP (0x34ef3b8) null +[15] PADOP (0x34ef3f0) +gvsv GV (0x33c5ed0) *count OP (0x34ef0c8) unstack OP (0x34eedd0) unstack -e syntax OK
        2. Strings:
          C:\test>perl -MO=Terse -E"$a=join'',map{int rand 2}1..64;@b=map{int ra +nd 2}1..64; $count=($a&$b)=~tr[1][]" LISTOP (0x3447bc0) leave [1] OP (0x344f178) enter COP (0x3447c00) nextstate BINOP (0x3447c68) sassign LISTOP (0x3447cd8) join [8] OP (0x3447ca8) pushmark SVOP (0x3448118) const [22] PV (0x332ca20) "" LOGOP (0x3447d88) mapwhile [7] LISTOP (0x3447df8) mapstart OP (0x3447dc8) pushmark UNOP (0x3447d50) null UNOP (0x3447e38) null LISTOP (0x3447fc8) scope OP (0x3448008) null [177] UNOP (0x3448070) int [3] UNOP (0x34480a8) rand [2] SVOP (0x34480e0) const [6] IV +(0x332cb58) 2 UNOP (0x3447e70) rv2av SVOP (0x3447d18) const [23] AV (0x3327640) UNOP (0x3448150) null [15] PADOP (0x3448188) gvsv GV (0xa76a8) *a COP (0x34475c8) nextstate BINOP (0x3447630) aassign [17] UNOP (0x34476a0) null [142] OP (0x3447670) pushmark LOGOP (0x34477c0) mapwhile [16] LISTOP (0x3447830) mapstart OP (0x3447800) pushmark UNOP (0x3447788) null UNOP (0x3447870) null LISTOP (0x3447a00) scope OP (0x3447a40) null [177] UNOP (0x3447aa8) int [12] UNOP (0x3447ae0) rand [11] SVOP (0x3447b18) const [15] IV + (0x3326f00) 2 UNOP (0x34478a8) rv2av SVOP (0x3447750) const [24] AV (0x3326900) UNOP (0x3447710) null [142] OP (0x34476e0) pushmark UNOP (0x3447b50) rv2av [10] PADOP (0x3447b88) gv GV (0x3327010) *b COP (0x344f1e8) nextstate BINOP (0x344f250) sassign UNOP (0x344f290) null BINOP (0x344f3e8) bit_and [21] UNOP (0x344f498) null [15] PADOP (0x34474e0) gvsv GV (0xa76a8) *a UNOP (0x344f428) null [15] PADOP (0x344f460) gvsv GV (0x3327010) *b PVOP (0x344f3b0) trans UNOP (0x3447518) null [15] PADOP (0x3447550) gvsv GV (0x33262d0) *count -e syntax OK
        3. Bits:
          C:\test>perl -MO=Terse -E"$a=int rand 2**64;$b=int rand 2**64; $count += unpack '%32b*', $a & $b" LISTOP (0x33e7460) leave [1] OP (0x33e6e60) enter COP (0x33e74a0) nextstate BINOP (0x33e7508) sassign UNOP (0x33e7548) int [4] UNOP (0x33e7580) rand [3] SVOP (0x33e75b8) const [13] NV (0x32ca498) 1.844674407 +37096e+019 UNOP (0x33e76a0) null [15] PADOP (0x33e76d8) gvsv GV (0x107668) *a COP (0x33e71f0) nextstate BINOP (0x33e7258) sassign UNOP (0x33e7298) int [8] UNOP (0x33e72d0) rand [7] SVOP (0x33e7308) const [14] NV (0x32ca5a0) 1.844674407 +37096e+019 UNOP (0x33e73f0) null [15] PADOP (0x33e7428) gvsv GV (0x32ca510) *b COP (0x33e6ed0) nextstate BINOP (0x33e6f38) sassign LISTOP (0x33e6fa8) unpack OP (0x33e6f78) null [3] SVOP (0x33e7108) const [15] PV (0x32ca600) "%32b*" BINOP (0x33e6fe8) bit_and [12] UNOP (0x33e7098) null [15] PADOP (0x33e70d0) gvsv GV (0x107668) *a UNOP (0x33e7028) null [15] PADOP (0x33e7060) gvsv GV (0x32ca510) *b UNOP (0x33e7140) null [15] PADOP (0x33e7178) gvsv GV (0x32ca5d0) *count -e syntax OK

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Just be careful to create your data as bitstrings in the first place. If you create arrays and then turn them into bitstrings to do the comparison, then it is not that fast:

        use strict; use warnings; use Benchmark 'cmpthese'; sub create { map {rand() < $_[1] ? 1 : 0} 1..$_[0] } sub compare2a { # first find 1s in x, then check in ys my $x = shift; my $n = shift; my @nxs = grep { $x->[$_] } 0..$n-1; return map { scalar grep {$_} @{$_}[@nxs] } @_; } sub compare4 { # bitstrings my $x = shift; $x = pack 'b*', join '', @$x; return map { unpack '%32b*', ( $x & pack 'b*', join'',@$_ ) } @_; } my $n = 15000; my $p = 0.005; my $ny = 10; my @x = create $n, $p; my @ys = map { [ create $n, $p ] } 1..$ny; my @r2a = compare2a \@x, $n, @ys; my @r4 = compare4 \@x, @ys; print "compare2a: @r2a\n"; print "compare4: @r4\n"; cmpthese( -5, { compare2a => sub{ compare2a \@x, $n, @ys }, compare4 => sub{ compare4 \@x, @ys }, } );
        Result:
        Rate compare4 compare2a compare4 246/s -- -55% compare2a 543/s 120% --
Re: Comparing two arrays
by educated_foo (Vicar) on Dec 15, 2013 at 12:57 UTC
    In addition to what others have said -- use packed bit-arrays -- you can sort the X arrays first, then do a binary search to find the closest X for each Y (~16 comparisons apiece).
      you can sort the X arrays first, then do a binary search to find the closest X for each Y

      Sort by what criteria? "closest" by what criteria?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Sort by what criteria? "closest" by what criteria?
        That obviously depends upon what the OP is trying to do. Maybe he wants lexicographic sorting, or maybe some other mapping; it's not specified in the question, so I didn't guess.
Re: Comparing two arrays
by oiskuu (Monk) on Dec 15, 2013 at 18:57 UTC

    Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.

    However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.

    Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. Make an array N[X] that counts hits per vector X. These structures you can keep in memory.

    Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)

      I wrote a crude implementation of merge(sort) to estimate the matching speed. 30 buckets each with ~10k sorted, "l/l" packed numbers (representing y-vector ID's). Merge and scan takes about 5 ms (unless I've botched it somewhere). Full search is ten minutes at that rate. Hm.

      Update. SSE4.1 version doubled the performance.

Re: Comparing two arrays
by Anonymous Monk on Dec 15, 2013 at 21:43 UTC
    I would try to use a hash to some advantage. For example, total up the sum of the entries and perhaps the index of the leftmost '1', then make that into a hash-key. The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them. The hash won't tell you which one (if any) matches, but only entries in that chosen bucket need be considered.

      The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them.

      This was my thought too; with that many items to cross-compare, doing various "cheats" to cut down on how many you actually need to search is profitable. Even something as simple as just keeping counts of the number of 1's in each array can give you a real cheap method of asking "is it possible that these two arrays are the same?"

      You'll get false positives (probably a lot of them), but you can guarantee no false negatives, so that can chop orders of magnitude off the problem. With a 15,000-element array, you know there's some number between 0 and 15000 1's; if they're evenly distributed (which they won't be even close, but it's a starting point to guess) then you're immediately dropping from "test these 5 million" to "test these 333", which is a huge speedup. Even if the distribution is off by an order or two of magnitude, dropping to 3 or 30,000 comparisons is a giant gain from 5 million.

      Then adding some additional cheap to store/compare (maybe not necessarily calculate, but you may be able do this at the same time as the arrays get built for minimal added expense) hints like the leftmost-1-index or rightmost-1-index or longer-run-of-1's or some other thing meaningful to your data set, can chop that down even further. At what point the diminishing returns start hurting you is situation specific; it would take knowing your data and possibly some experimentation to find it.

      In essence, these sort of universe-narrowing tricks can be seen as variations on the theme of Bloom filters. A quick glance at cpan shows up some Bloom filter implementations, but it's hard to say whether they're well suited to this particular type/scale of usage. But they may on deeper investigation prove to be so...

Re: Comparing two arrays
by sundialsvc4 (Monsignor) on Dec 16, 2013 at 13:05 UTC

    Here’s another way to do it:   scan the arrays to convert them to a vector using some form of “run-length encoding.”   For instance, a list containing the position of the next '1' and the number of '1's that occur.   Now for the magic:   use a Digest algorithm of some sort ... could be CRC, SHA1, could be anything ... to convert that vector into a single value that can be used as a hash.

    Actually, if you use a truly-decent message digest algorithm, like SHA1, you probably won’t have to do anything else, because one and only one vector will map to the same hash value.   You have to preprocess each array one time to compute its message-digest hash, but from there on out you might not have to examine the array contents at all.   A comparison of the digests ought to be sufficient.   Aside from the overhead of passing through each array one time to calculate its digest, the overhead of laborious comparison ... disappears.

      one and only one vector will map to the same hash value
      Do you mean there are no hash collisions for SHA1?
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      sundialsvc4:

      If you're looking for duplicate vectors, digesting can greatly reduce the number of comparisons, but it won't let you escape them altogether: http://en.wikipedia.org/wiki/Pigeonhole_principle.

      Update: s/While/If you're looking for duplicate vectors/ and changed conjunction so the sentence still reads well.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        While digesting can greatly reduce the number of comparisons

        That would only be true if the OP was looking for exact matches. He isn't.

        He's looking for the best matches, where 'best' is defined in terms of the number of set bits in matching positions. No hashing, digesting nor sorting approach to this problem is possible.

        Every X must be fully compared against every Y.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      the overhead of laborious comparison ... disappears.

      Total garbage.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        It is, at best, very rude to say “total garbage” in response to a post ... it doesn’t make you look like a genius, merely a boor.   (I searched the entire thread for the word “best” and did not find it.   Perhaps your eyes are better than mine.)

        That being said, i-f the requirement is for “a matching entry,” then the technique that I described will work extremely well.   Strictly speaking, yes, it is possible for (say) a hash-collision to occur, but with a strong algorithm like SHA1 it basically isn’t going to happen.

        Any exploitable “predictable fact” about the actual data can be profitably used to reduce the search-space, and (IMHO) in a situation such as this, pragmatically must be.   Even something as basic as “the total number of 1’s” can be pressed into service.   If you can “reasonably predict” that the differences between the searched-for array and the “best fit” that will be found will consist, let’s say, of a change-of-state of no more than (some) n positions, then even a brute-force search could be limited to consider only the candidates which fall into that range, perhaps starting with any exact-matches and then working outward x for x in (1..n).   This does, of course, open up the possibility of a statistical Type-2 Error (concluding that no best-match exists when in fact one does), but this might be judged to be either acceptable or necessary.   (Or not ...)

        If necessity really must become the mother of invention, and once again if you know that there are exploitable characteristics of the data, it is also possible to apply hashing or ones-counting to slices of the total vector.   Instead of merely counting all the 1’s, count them in every (say) thousand bits.   Apply some useful heuristic to this vector of sums to decide whether you choose to examine the whole thing.

        In the end, the problem won’t be completely-abstract, nor will be its solution.   There must be something, in the real world, that stipulates what is and what is not “best,” or even “plausible.”   It’s my opinion that you must solve this problem, at least in significant part, by reducing the total number of vectors that you consider, and by selecting for consideration only those which are “most likely.”   Representational optimizations such as the use of bit-vectors may also be important, but even these can’t be applied in a brute-force way.   You’ll simply never get the work done.

      This method was considered and discarded in the OP:

      I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067218]
Approved by hdb
Front-paged by snoopy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (15)
As of 2014-04-16 17:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls