I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like
"Foo a" would hash to "a1b2c3d4", and
"Foo b" would hash to "a1b2c3d5"
- they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.
Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm? | [reply] |
I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.
| [reply] |
Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.
Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.
Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.
| [reply] |
The LSH hint brought me to even more related keywords (noted here for the accidental passer by):
- on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering
- on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus
And yes, it is a spooky topic ;)
| [reply] |