<?xml version="1.0" encoding="windows-1252"?>
<node id="996543" title="Re^3: Comparing sets of phrases stored in a database?" created="2012-09-30 17:18:47" updated="2012-09-30 17:18:47">
<type id="11">
note</type>
<author id="171588">
BrowserUk</author>
<data>
<field name="doctext">
&lt;blockquote&gt;&lt;i&gt;
My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved.
&lt;/i&gt;&lt;/blockquote&gt;

&lt;p&gt;Then, I would approach the problem this way.

&lt;ol&gt;&lt;li&gt;Store the corpus of phrases in its own table each with a unique numeric value.
&lt;/li&gt;&lt;li&gt;Each set of phrases then becomes a bitfield with 1-bit set in the appropriate position for each phrase that set contains.
&lt;/li&gt;&lt;li&gt;Your similarity can then be some hueristic based on that population counts (bit count) of ANDing and XORing  the two bitstrings that represent each set.

&lt;p&gt;The population count of the result of ANDing two set's bitstrings will tell you how many phrases they have in common; 
&lt;p&gt;The population count of the result of XORing two set's bitstrings will tell you how many phrases that appear in one but not the other.

&lt;p&gt;You can then combine those two numbers mathematically to reflect whether the sharing of phrases is more important than having phrases not in common -- or vice versa -- and come up with a single number for each pairing that you can then apply a threshold value to.

&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;You'd need a DB that supports bitstrings -- postgresql and mysql seem to -- and AND/XOR &amp; popcount of bitstrings. I couldn't (from a quick look) see a popcount function, but (at least in the case of PgSQL), it should be a simple thing to add a PL/Perl function to do this using Perl's &lt;code&gt;
$popcount = unpack '%b*', $bitstring;
&lt;/code&gt;


&lt;p&gt;Food for thought perhaps.


&lt;div class="pmsig"&gt;&lt;div class="pmsig-171588"&gt;
&lt;hr /&gt;
&lt;font size=1 &gt;
&lt;div&gt;With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'&lt;/div&gt;
&lt;div&gt;Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.&lt;/div&gt;
&lt;div&gt;"Science is about questioning the status quo. Questioning authority". &lt;/div&gt;
&lt;div&gt;In the absence of evidence, opinion is indistinguishable from prejudice.
&lt;p align=right&gt; [http://thebottomline.cpaaustralia.com.au/|RIP Neil Armstrong]&lt;/p&gt;&lt;/div&gt;
&lt;/font&gt;

&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
996530</field>
<field name="parent_node">
996535</field>
</data>
</node>
