Re: What is the best way to store and look-up a "similarity vector"?

Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries.

These are completely different problems. To find exactly matching things, you can just serialize and use hashing. To find "close" things, you need a measure of "closeness," and a much more complicated data structure.

Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN to judge if they are any good for production tasks.

vec and the bitwise string operators work pretty well.

Comment on Re: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^2: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 16:50 UTC
I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing. Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like "Foo a" would hash to "a1b2c3d4", and "Foo b" would hash to "a1b2c3d5" - they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order. Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm?	[reply]
Re^3: What is the best way to store and look-up a "similarity vector"? by educated_foo (Vicar) on Nov 14, 2013 at 20:15 UTC
I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.	[reply]
Re^3: What is the best way to store and look-up a "similarity vector"? by oiskuu (Hermit) on Nov 15, 2013 at 09:42 UTC
Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs. Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc. Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.	[reply]
Re^4: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 15, 2013 at 12:47 UTC
The LSH hint brought me to even more related keywords (noted here for the accidental passer by): on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus And yes, it is a spooky topic ;)	[reply]