Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal.

The beauty of using bitstrings is that the comparisons are very fast. For example, the following compares each of 15,000 bitstrings against the sample input in less that 1/10th of a second:

#! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map pack( 'Q', int rand 2**64 ), 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack 'b*', $S; our $T //= 0.8; my $start = time; print 'Input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -S=1111111111111111111111111111111100000000000 +000000000000000000000 -T=0.25 Input: 111111111111111111111111111111110000000000000000000000000000 +0000 0.280000: 111111111111111111111111111111111111111111111111100000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000010 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111101000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111100000000100 +0001 Comparing 1 against 15000 sets, took: 0.069085 seconds

The bitstrings used can present a corpus of up to 64 phrases -- just for convenience -- but even if you multiply the size of the corpus by 10:

#! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map{ pack( 'Q', int rand 2**64 ) x 10 } 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack( 'b*', $S ) x 10; our $T //= 0.8; my $start = time; print 'input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -T=0.1 input: 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 0.103448: 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 0.103448: 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 0.122807: 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 Comparing 1 against 15000 sets, took: 0.069599 seconds

The time required barely changes: 0.065085 .v. 0.069599, despite doing 10 times the amount of work.

Used with imagination, bitstrings are the best kept secret of of data processing.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


In reply to Re^5: Comparing sets of phrases stored in a database? by BrowserUk
in thread Comparing sets of phrases stored in a database? by BUU

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (7)
    As of 2014-10-22 01:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (112 votes), past polls