Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: Comparing sets of phrases stored in a database?

by BUU (Prior)
on Sep 30, 2012 at 20:24 UTC ( #996535=note: print w/ replies, xml ) Need Help??


in reply to Re: Comparing sets of phrases stored in a database?
in thread Comparing sets of phrases stored in a database?

Fortunately I don't actually have to deal with any of that. My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved.

While technically each item in the set is a phrase, for the purposes of this discussion it can be treated as a unique ID of any sort you prefer, but probably a number, probably generated by a hash function.


Comment on Re^2: Comparing sets of phrases stored in a database?
Re^3: Comparing sets of phrases stored in a database?
by remiah (Hermit) on Sep 30, 2012 at 21:12 UTC

    What is *similar* for you, then?

Re^3: Comparing sets of phrases stored in a database?
by BrowserUk (Pope) on Sep 30, 2012 at 21:18 UTC
    My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved.

    Then, I would approach the problem this way.

    1. Store the corpus of phrases in its own table each with a unique numeric value.
    2. Each set of phrases then becomes a bitfield with 1-bit set in the appropriate position for each phrase that set contains.
    3. Your similarity can then be some hueristic based on that population counts (bit count) of ANDing and XORing the two bitstrings that represent each set.

      The population count of the result of ANDing two set's bitstrings will tell you how many phrases they have in common;

      The population count of the result of XORing two set's bitstrings will tell you how many phrases that appear in one but not the other.

      You can then combine those two numbers mathematically to reflect whether the sharing of phrases is more important than having phrases not in common -- or vice versa -- and come up with a single number for each pairing that you can then apply a threshold value to.

    You'd need a DB that supports bitstrings -- postgresql and mysql seem to -- and AND/XOR & popcount of bitstrings. I couldn't (from a quick look) see a popcount function, but (at least in the case of PgSQL), it should be a simple thing to add a PL/Perl function to do this using Perl's

    $popcount = unpack '%b*', $bitstring;

    Food for thought perhaps.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      I've been working on a slightly similar method and it produces pretty decent results. My only problem is when do I do the comparisons? If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal. But if I compute it when I add a new item then only the newly added items will be similar to the older items, the old items won't be similar to the new items.

      My dataset isn't giant though, I'll probably have somewhere between 5k-15k sets and adding 100-200 a day. Maybe I'm over optimizing.
        If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal.

        The beauty of using bitstrings is that the comparisons are very fast. For example, the following compares each of 15,000 bitstrings against the sample input in less that 1/10th of a second:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map pack( 'Q', int rand 2**64 ), 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack 'b*', $S; our $T //= 0.8; my $start = time; print 'Input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -S=1111111111111111111111111111111100000000000 +000000000000000000000 -T=0.25 Input: 111111111111111111111111111111110000000000000000000000000000 +0000 0.280000: 111111111111111111111111111111111111111111111111100000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000010 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111101000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111100000000100 +0001 Comparing 1 against 15000 sets, took: 0.069085 seconds

        The bitstrings used can present a corpus of up to 64 phrases -- just for convenience -- but even if you multiply the size of the corpus by 10:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map{ pack( 'Q', int rand 2**64 ) x 10 } 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack( 'b*', $S ) x 10; our $T //= 0.8; my $start = time; print 'input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -T=0.1 input: 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 0.103448: 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 0.103448: 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 0.122807: 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 Comparing 1 against 15000 sets, took: 0.069599 seconds

        The time required barely changes: 0.065085 .v. 0.069599, despite doing 10 times the amount of work.

        Used with imagination, bitstrings are the best kept secret of of data processing.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://996535]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-07-30 21:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (241 votes), past polls