Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^4: Comparing sets of phrases stored in a database?

by BUU (Prior)
on Sep 30, 2012 at 21:41 UTC ( #996545=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Comparing sets of phrases stored in a database?
in thread Comparing sets of phrases stored in a database?

I've been working on a slightly similar method and it produces pretty decent results. My only problem is when do I do the comparisons? If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal. But if I compute it when I add a new item then only the newly added items will be similar to the older items, the old items won't be similar to the new items.

My dataset isn't giant though, I'll probably have somewhere between 5k-15k sets and adding 100-200 a day. Maybe I'm over optimizing.


Comment on Re^4: Comparing sets of phrases stored in a database?
Re^5: Comparing sets of phrases stored in a database?
by BrowserUk (Pope) on Sep 30, 2012 at 23:34 UTC
    If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal.

    The beauty of using bitstrings is that the comparisons are very fast. For example, the following compares each of 15,000 bitstrings against the sample input in less that 1/10th of a second:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map pack( 'Q', int rand 2**64 ), 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack 'b*', $S; our $T //= 0.8; my $start = time; print 'Input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -S=1111111111111111111111111111111100000000000 +000000000000000000000 -T=0.25 Input: 111111111111111111111111111111110000000000000000000000000000 +0000 0.280000: 111111111111111111111111111111111111111111111111100000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000010 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111101000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111100000000100 +0001 Comparing 1 against 15000 sets, took: 0.069085 seconds

    The bitstrings used can present a corpus of up to 64 phrases -- just for convenience -- but even if you multiply the size of the corpus by 10:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map{ pack( 'Q', int rand 2**64 ) x 10 } 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack( 'b*', $S ) x 10; our $T //= 0.8; my $start = time; print 'input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -T=0.1 input: 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 0.103448: 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 0.103448: 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 0.122807: 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 Comparing 1 against 15000 sets, took: 0.069599 seconds

    The time required barely changes: 0.065085 .v. 0.069599, despite doing 10 times the amount of work.

    Used with imagination, bitstrings are the best kept secret of of data processing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://996545]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2014-12-18 11:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (51 votes), past polls