comment on

Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.

However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.

Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. ~~Make an array N[X] that counts hits per vector X~~. These structures you can keep in memory.

When processing, preallocate arrays and push the X's. Then, for each vector Y, store the hits: store($x, $N[$x]++, $y); With ca 30*200 hits per vector, total hits are 5M*6k == 30G. Say 3* the expected hits, 4 bytes per id, this comes to 3*4*30G == 360 GB. The store just writes integer y at offset 3.6e6*x + 4*N.

Finding best matches is a linear scan of that 360GB file. Each 3.6MB segment corresponds to all hits of vector X. Sort(?), count, keep the best. With a big cluster, you could partition the whole processing and run it all in memory. Anyway, use a mmap'ed file.

Update: Modern CPU's may have 512 DTLB entries or so. It is probably best if you process no more than ~500 X-vectors at a time (per thread). This amounts to 200 passes over your Y-vectors. Hence, compacting the Y ought to be the first step. Again, a cluster of 200+ cores might come handy. Actually, this won't be a bottleneck (hm?).

Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)

In reply to Re: Comparing two arrays by oiskuu
in thread Comparing two arrays by baxy77bax

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks