Re: Comparing sets of phrases stored in a database?

The first thing you need to do is define how you, as a human being, would judge the similarity of the sets.

For example, you start with a set (A), and you make an exact copy (B). You will (presumably) judge these as very similar.

What if You remove 1 of the phrases. Are they still similar?
If the original set contains 100 phrases, and you remove phrases 1 at a time from the duplicate, does the similarity drop linearly?
What if you reversed the words in all of the phrases in the second set. Is it still very similar or completely dissimilar?
Is ordering of the phrase words important.
How about if you removed one word from each phrase in the second set?
Do the phrases need to be exactly the same, to be counted similar.
How about if you looked up each word in a thesaurus and substituted the nearest alternative word. Similar? Dissimilar?
Are looking for semantic similarity.
How about if you misspelled every word by one character -- an ommision, and insertion, or transposition. Similar? Dissimilar?
Can typos occur? Is it possible for you to correct them?
How about if you reverse the ordering of the phrases in the second set. Similar? Dissimilar?
Are the sets ordered or unordered.
If one set consists entirely of "large blue woolen jumper" and the other "Angora sweater, navy, XL". Similar? Dissimilar?
Semantics again.

Once you've decided how you would make the judgement, then you stand some chance of being able to lay out a set of rules. And once you have that, you can start to look for a good way to implement them.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Comment on Re: Comparing sets of phrases stored in a database?

Replies are listed 'Best First'.
Re^2: Comparing sets of phrases stored in a database? by BUU (Prior) on Sep 30, 2012 at 20:24 UTC
Fortunately I don't actually have to deal with any of that. My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved. While technically each item in the set is a phrase, for the purposes of this discussion it can be treated as a unique ID of any sort you prefer, but probably a number, probably generated by a hash function.	[reply]
Re^3: Comparing sets of phrases stored in a database? by BrowserUk (Patriarch) on Sep 30, 2012 at 21:18 UTC
My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved. Then, I would approach the problem this way. Store the corpus of phrases in its own table each with a unique numeric value. Each set of phrases then becomes a bitfield with 1-bit set in the appropriate position for each phrase that set contains. Your similarity can then be some hueristic based on that population counts (bit count) of ANDing and XORing the two bitstrings that represent each set. The population count of the result of ANDing two set's bitstrings will tell you how many phrases they have in common; The population count of the result of XORing two set's bitstrings will tell you how many phrases that appear in one but not the other. You can then combine those two numbers mathematically to reflect whether the sharing of phrases is more important than having phrases not in common -- or vice versa -- and come up with a single number for each pairing that you can then apply a threshold value to. You'd need a DB that supports bitstrings -- postgresql and mysql seem to -- and AND/XOR & popcount of bitstrings. I couldn't (from a quick look) see a popcount function, but (at least in the case of PgSQL), it should be a simple thing to add a PL/Perl function to do this using Perl's `$popcount = unpack '%b*', $bitstring;` [download] Food for thought perhaps. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply] [d/l]
Re^4: Comparing sets of phrases stored in a database? by BUU (Prior) on Sep 30, 2012 at 21:41 UTC
I've been working on a slightly similar method and it produces pretty decent results. My only problem is when do I do the comparisons? If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal. But if I compute it when I add a new item then only the newly added items will be similar to the older items, the old items won't be similar to the new items. My dataset isn't giant though, I'll probably have somewhere between 5k-15k sets and adding 100-200 a day. Maybe I'm over optimizing.	[reply]
Re^5: Comparing sets of phrases stored in a database? by BrowserUk (Patriarch) on Sep 30, 2012 at 23:34 UTC
Re^3: Comparing sets of phrases stored in a database? by remiah (Hermit) on Sep 30, 2012 at 21:12 UTC
What is similar for you, then?	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks