Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re: Comparing two arrays

by oiskuu (Hermit)
on Dec 15, 2013 at 18:57 UTC ( #1067249=note: print w/ replies, xml ) Need Help??

in reply to Comparing two arrays

Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.

However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.

Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. Make an array N[X] that counts hits per vector X. These structures you can keep in memory.

When processing, preallocate arrays and push the X's. Then, for each vector Y, store the hits: store($x, $N[$x]++, $y); With ca 30*200 hits per vector, total hits are 5M*6k == 30G. Say 3* the expected hits, 4 bytes per id, this comes to 3*4*30G == 360 GB. The store just writes integer y at offset 3.6e6*x + 4*N.

Finding best matches is a linear scan of that 360GB file. Each 3.6MB segment corresponds to all hits of vector X. Sort(?), count, keep the best. With a big cluster, you could partition the whole processing and run it all in memory. Anyway, use a mmap'ed file.

Update: Modern CPU's may have 512 DTLB entries or so. It is probably best if you process no more than ~500 X-vectors at a time (per thread). This amounts to 200 passes over your Y-vectors. Hence, compacting the Y ought to be the first step. Again, a cluster of 200+ cores might come handy. Actually, this won't be a bottleneck (hm?).

Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)

Replies are listed 'Best First'.
Re^2: Comparing two arrays
by oiskuu (Hermit) on Dec 16, 2013 at 10:49 UTC
    I wrote a crude implementation of merge(sort) to estimate the matching speed. 30 buckets each with ~10k sorted, "l/l" packed numbers (representing y-vector ID's). Merge and scan takes about 5 ms (unless I've botched it somewhere). Full search is ten minutes at that rate. Hm.

    Update. SSE4.1 version doubled the performance.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1067249]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2016-06-25 12:43 GMT
Find Nodes?
    Voting Booth?
    My preferred method of making French fries (chips) is in a ...

    Results (325 votes). Check out past polls.