Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

word similarity measure

by karey3341 (Initiate)
on Feb 27, 2009 at 16:18 UTC ( #746935=perlquestion: print w/replies, xml ) Need Help??
karey3341 has asked for the wisdom of the Perl Monks concerning the following question:

If my data looks like this:

word 1: 100 101 101 102 102 102 106 106

word 2: 101 104 106 110 113 129 131 148

word 3: 101 153 175 180 381

word 4: 106 110 113 122 131 137 142 148

word 5: 120 165 169

Where word 1,2,3,4,5 represent different words, numbers represent a list of paper those words have been used as keywords.

How can I calculate similarity between these words?

Replies are listed 'Best First'.
Re: word similarity measure
by tilly (Archbishop) on Feb 27, 2009 at 16:32 UTC

      Or perhaps Vector Space

      This module takes a list of documents (in English) and builds a simple in-memory search engine using a vector space model. Documents are stored as PDL objects, and after the initial indexing phase, the search should be very fast. This implementation applies a rudimentary stop list to filter out very common words, and uses a cosine measure to calculate document similarity.

Re: word similarity measure
by Old_Gray_Bear (Bishop) on Feb 27, 2009 at 16:56 UTC
    You need to define what you mean by "similarity".

    At first glance words 1, 2. and 4 are 'similar' since they each have the same number of sub-components. A second glance reveals that words 1, 2, and 3 are 'similar' - they each contain '101'. And words 2 and 4 are 'similar', they are the only words that contain 148 and 131.

    I suspect that once you have defined your terms, you will be able to write a function that takes two words and returns the degree of 'similarity' between them. Once you have all of the pair-wise ratings computed, sort() will let you rank the papers from most alike to least.

    This sounds like the kind of problem a plagiarism detector is designed for.

    I Go Back to Sleep, Now.


      This sounds like the kind of problem a plagiarism detector is designed for.

      If, in fact, that is what the OP is after, s/he may benefit from looking at the nodes mentioned here: Re: Finding plagarized content.

      Update: I rather think, OTOH, that the OP may be looking for something more like Ted Pedersen's SenseClusters (more)...


Re: word similarity measure
by kennethk (Abbot) on Feb 27, 2009 at 16:33 UTC

    Ignoring subtleties about how you may have developed your keyword->index mapping, the easiest way to measure the similarity would be to generate a hash with your word identifiers as keys and then brute force a similarity array. Something like:

    @counts = (); for $i_word (1 .. $#words) { for $j_word (0 .. $i_word-1) { $count[$i_word][$j_word] = 0; foreach (keys %{$paper{$i_word}}) { if (exists $paper{$j_word}{$_} { $count[$i_word][$j_word]++; } } } }

    If you aren't familiar with lists of lists, take a gander at perllol.

Re: word similarity measure
by bruno (Friar) on Feb 27, 2009 at 16:32 UTC
    Do you know the words, or are they unknown to you?

    Do you assume that, because two words appeared in the same paper, they must be identical or similar in some way?

    What do you mean by 'keywords', and 'papers'? Do they have the same meaning as in scientific research, where 'paper' is a scientific article and 'keyword' is a research subject, organism of interest, or method?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://746935]
Approved by Old_Gray_Bear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2017-06-26 21:56 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (594 votes). Check out past polls.