Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

word similarity measure

by karey3341 (Initiate)
on Feb 27, 2009 at 16:18 UTC ( #746935=perlquestion: print w/ replies, xml ) Need Help??
karey3341 has asked for the wisdom of the Perl Monks concerning the following question:


If my data looks like this:

word 1: 100 101 101 102 102 102 106 106

word 2: 101 104 106 110 113 129 131 148

word 3: 101 153 175 180 381

word 4: 106 110 113 122 131 137 142 148

word 5: 120 165 169


Where word 1,2,3,4,5 represent different words, numbers represent a list of paper those words have been used as keywords.

How can I calculate similarity between these words?

Comment on word similarity measure
Re: word similarity measure
by tilly (Archbishop) on Feb 27, 2009 at 16:32 UTC

      Or perhaps Vector Space

      This module takes a list of documents (in English) and builds a simple in-memory search engine using a vector space model. Documents are stored as PDL objects, and after the initial indexing phase, the search should be very fast. This implementation applies a rudimentary stop list to filter out very common words, and uses a cosine measure to calculate document similarity.

Re: word similarity measure
by bruno (Friar) on Feb 27, 2009 at 16:32 UTC
    Do you know the words, or are they unknown to you?

    Do you assume that, because two words appeared in the same paper, they must be identical or similar in some way?

    What do you mean by 'keywords', and 'papers'? Do they have the same meaning as in scientific research, where 'paper' is a scientific article and 'keyword' is a research subject, organism of interest, or method?

Re: word similarity measure
by kennethk (Monsignor) on Feb 27, 2009 at 16:33 UTC

    Ignoring subtleties about how you may have developed your keyword->index mapping, the easiest way to measure the similarity would be to generate a hash with your word identifiers as keys and then brute force a similarity array. Something like:

    @counts = (); for $i_word (1 .. $#words) { for $j_word (0 .. $i_word-1) { $count[$i_word][$j_word] = 0; foreach (keys %{$paper{$i_word}}) { if (exists $paper{$j_word}{$_} { $count[$i_word][$j_word]++; } } } }

    If you aren't familiar with lists of lists, take a gander at perllol.

Re: word similarity measure
by Old_Gray_Bear (Bishop) on Feb 27, 2009 at 16:56 UTC
    You need to define what you mean by "similarity".

    At first glance words 1, 2. and 4 are 'similar' since they each have the same number of sub-components. A second glance reveals that words 1, 2, and 3 are 'similar' - they each contain '101'. And words 2 and 4 are 'similar', they are the only words that contain 148 and 131.

    I suspect that once you have defined your terms, you will be able to write a function that takes two words and returns the degree of 'similarity' between them. Once you have all of the pair-wise ratings computed, sort() will let you rank the papers from most alike to least.

    This sounds like the kind of problem a plagiarism detector is designed for.

    ----
    I Go Back to Sleep, Now.

    OGB

      This sounds like the kind of problem a plagiarism detector is designed for.

      If, in fact, that is what the OP is after, s/he may benefit from looking at the nodes mentioned here: Re: Finding plagarized content.


      Update: I rather think, OTOH, that the OP may be looking for something more like Ted Pedersen's SenseClusters (more)...

      HTH,

      planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://746935]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2014-10-21 09:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (99 votes), past polls