Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^5: What is the best way to store and look-up a "similarity vector"?

by BrowserUk (Pope)
on Nov 14, 2013 at 19:50 UTC ( #1062648=note: print w/replies, xml ) Need Help??

in reply to Re^4: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

  1. Did I get that right?


  2. Let's assume we've got millions of records, is that looping+comparing efficient?

    Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity':

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub randNbits { pack 'C*', map rand(256), 1 .. ( shift() / 8 )} our $TARGET //= 0.75; # 75% match our $RECS //= 1e6; our $BITS //= 1024; my @recs; $recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1; my $select = randNbits( $BITS ); my $count = 0; my $start = time; for my $rec ( @recs ) { my $score = unpack( '%32b*', $select & $rec ) / $BITS; ++$count if $score >= $TARGET; } printf "Comparing $RECS records on $BITS attributes to discover $count + matches took: %.9f s/rec\n", ( time() - $start ) / $RECS;

    A few runs show that it scales extremely well:

    C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33 Comparing 1e6 records on 64 attributes to discover 121722 matches took +: 0.000001229 s/rec C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33 Comparing 1e6 records on 256 attributes to discover 1200 matches took: + 0.000000760 s/rec C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33 Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0 +.000000936 s/rec

    No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records.

  3. Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

    It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records.

    However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 20:48 UTC
    (Shaking my head in disbelief of your exhaustive help and impressive algorithm savviness...)
    BrowserUk, once again top discussion and help!

    Thank you very much. Both thumbs up!.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1062648]
[marto]: yay, Depeche Mode at the Barrowlands on Sunday. The best venue in Glasgow for music, despite it's looks :) Also the smallest gig DM will have played in a long time

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (12)
As of 2017-03-24 11:28 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (301 votes). Check out past polls.