Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: What is the best way to store and look-up a "similarity vector"?

by BrowserUk (Pope)
on Nov 14, 2013 at 17:07 UTC ( #1062628=note: print w/ replies, xml ) Need Help??


in reply to What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

But is that how a project like Panopticlick calculates "similarity" across bit-vectors?

I've no idea about pana-pano, that thing, but ostensibly, it can be as simple as counting the number of matching bits (properties):

my $needle = getBits(); my $nBits = unpack '%32b*', $needle; for my $straw ( @haystack ) { my $similarity = unpack '%32b*', $straw & needle; print "Percentage similarity %f\n", $similarity / $nBits * 100; }

For parts/product catalogue type applications, bit-mapped data records can be very effective and efficient, because their attributes tend to have a fixed (and limited) number of values. Ie. Half a dozen colors; half a dozen sizes; 2 or 3 finishes; etc. That means each choice can be represented by a bit position in a vector. Selection can be extremely fast.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: What is the best way to store and look-up a "similarity vector"?
Download Code
Re^2: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 17:38 UTC
    Yes, totally agree. For example, where would file-systems be without bit-maps?!

    My problem is that my data is more or less schema-less. I know about a certain amount of properties that might show up, but similar to the "Browser-fingerprint" problem, I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Knowing all varieties is then similar to an algorithm that requires me to have properties in a fixed order, to align them to each other.
      I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout.

      Use a two-step schema.

      1. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen.

        Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs).

        The auto-incremented 'id' value becomes that A/V pair's bit position in the data records.

      2. The data records are variable length strings of bits.

        As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change.

      To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Just in order to wrap my head around what you propose:
        Properties:
        id|A/V-pairs
        -----------------
         0|colour:red
         1|colour:green
         2|material:metal
         3|colour:blue
         4|material:wood
         5|surface:rough
        
        And then our thing db:
        things:
        bitmap|thing
        012345|name
        -----------------
        101011|red-metal-wood-rough-Thing
        001101|metal-blue-rough-Thing
        01    |green-Thing
        

        Right?? So far so good, now fetching records:
        sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b*', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b*', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity
        Questions:
        - Did I get that right?
        - Let's assume we've got millions of records, is that looping+comparing efficient?
        - Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1062628]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2014-07-31 05:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (245 votes), past polls