Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: What is the best way to store and look-up a "similarity vector"?

by educated_foo (Vicar)
on Nov 14, 2013 at 16:31 UTC ( [id://1062625]=note: print w/replies, xml ) Need Help??


in reply to What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries.
These are completely different problems. To find exactly matching things, you can just serialize and use hashing. To find "close" things, you need a measure of "closeness," and a much more complicated data structure.
Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN to judge if they are any good for production tasks.
vec and the bitwise string operators work pretty well.
  • Comment on Re: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^2: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 16:50 UTC
    I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
    Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like
    "Foo a" would hash to "a1b2c3d4", and
    "Foo b" would hash to "a1b2c3d5"
    
    - they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.

    Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm?
      I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.
      Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.

      Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.

      Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.

        The LSH hint brought me to even more related keywords (noted here for the accidental passer by):
        • on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering
        • on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus
        And yes, it is a spooky topic ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1062625]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-18 06:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found