The first thing you need to do is define how you, as a human being, would judge the similarity of the sets.
For example, you start with a set (A), and you make an exact copy (B). You will (presumably) judge these as very similar.
- What if You remove 1 of the phrases. Are they still similar?
If the original set contains 100 phrases, and you remove phrases 1 at a time from the duplicate, does the similarity drop linearly?
- What if you reversed the words in all of the phrases in the second set. Is it still very similar or completely dissimilar?
Is ordering of the phrase words important.
- How about if you removed one word from each phrase in the second set?
Do the phrases need to be exactly the same, to be counted similar.
- How about if you looked up each word in a thesaurus and substituted the nearest alternative word. Similar? Dissimilar?
Are looking for semantic similarity.
- How about if you misspelled every word by one character -- an ommision, and insertion, or transposition. Similar? Dissimilar?
Can typos occur? Is it possible for you to correct them?
- How about if you reverse the ordering of the phrases in the second set. Similar? Dissimilar?
Are the sets ordered or unordered.
- If one set consists entirely of "large blue woolen jumper" and the other "Angora sweater, navy, XL". Similar? Dissimilar?
Once you've decided how you would make the judgement, then you stand some chance of being able to lay out a set of rules. And once you have that, you can start to look for a good way to implement them.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||