|Don't ask to ask, just ask|
Re: Comparing two arraysby oiskuu (Hermit)
|on Dec 15, 2013 at 18:57 UTC||Need Help??|
Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.
However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.
Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X.
When processing, preallocate arrays and push the X's. Then, for each vector Y, store the hits: store($x, $N[$x]++, $y); With ca 30*200 hits per vector, total hits are 5M*6k == 30G. Say 3* the expected hits, 4 bytes per id, this comes to 3*4*30G == 360 GB. The store just writes integer y at offset 3.6e6*x + 4*N.
Finding best matches is a linear scan of that 360GB file. Each 3.6MB segment corresponds to all hits of vector X. Sort(?), count, keep the best. With a big cluster, you could partition the whole processing and run it all in memory. Anyway, use a mmap'ed file.
Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)