Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: What is the best way to store and look-up a "similarity vector"?

by BrowserUk (Pope)
on Nov 14, 2013 at 18:07 UTC ( #1062634=note: print w/ replies, xml ) Need Help??


in reply to Re^2: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout.

Use a two-step schema.

  1. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen.

    Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs).

    The auto-incremented 'id' value becomes that A/V pair's bit position in the data records.

  2. The data records are variable length strings of bits.

    As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change.

To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: What is the best way to store and look-up a "similarity vector"?
Re^4: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 18:54 UTC
    Just in order to wrap my head around what you propose:
    Properties:
    id|A/V-pairs
    -----------------
     0|colour:red
     1|colour:green
     2|material:metal
     3|colour:blue
     4|material:wood
     5|surface:rough
    
    And then our thing db:
    things:
    bitmap|thing
    012345|name
    -----------------
    101011|red-metal-wood-rough-Thing
    001101|metal-blue-rough-Thing
    01    |green-Thing
    

    Right?? So far so good, now fetching records:
    sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b*', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b*', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity
    Questions:
    - Did I get that right?
    - Let's assume we've got millions of records, is that looping+comparing efficient?
    - Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons
      1. Did I get that right?

        Yes.

      2. Let's assume we've got millions of records, is that looping+comparing efficient?

        Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity':

        #! perl -slw use strict; use Time::HiRes qw[ time ]; sub randNbits { pack 'C*', map rand(256), 1 .. ( shift() / 8 )} our $TARGET //= 0.75; # 75% match our $RECS //= 1e6; our $BITS //= 1024; my @recs; $recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1; my $select = randNbits( $BITS ); my $count = 0; my $start = time; for my $rec ( @recs ) { my $score = unpack( '%32b*', $select & $rec ) / $BITS; ++$count if $score >= $TARGET; } printf "Comparing $RECS records on $BITS attributes to discover $count + matches took: %.9f s/rec\n", ( time() - $start ) / $RECS;

        A few runs show that it scales extremely well:

        C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33 Comparing 1e6 records on 64 attributes to discover 121722 matches took +: 0.000001229 s/rec C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33 Comparing 1e6 records on 256 attributes to discover 1200 matches took: + 0.000000760 s/rec C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33 Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0 +.000000936 s/rec

        No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records.

      3. Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

        It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records.

        However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        (Shaking my head in disbelief of your exhaustive help and impressive algorithm savviness...)
        BrowserUk, once again top discussion and help!

        Thank you very much. Both thumbs up!.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1062634]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2014-09-24 04:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (245 votes), past polls