http://www.perlmonks.org?node_id=746935

karey3341 has asked for the wisdom of the Perl Monks concerning the following question:


If my data looks like this:

word 1: 100 101 101 102 102 102 106 106

word 2: 101 104 106 110 113 129 131 148

word 3: 101 153 175 180 381

word 4: 106 110 113 122 131 137 142 148

word 5: 120 165 169


Where word 1,2,3,4,5 represent different words, numbers represent a list of paper those words have been used as keywords.

How can I calculate similarity between these words?

Replies are listed 'Best First'.
Re: word similarity measure
by tilly (Archbishop) on Feb 27, 2009 at 16:32 UTC

      Or perhaps Vector Space

      This module takes a list of documents (in English) and builds a simple in-memory search engine using a vector space model. Documents are stored as PDL objects, and after the initial indexing phase, the search should be very fast. This implementation applies a rudimentary stop list to filter out very common words, and uses a cosine measure to calculate document similarity.

Re: word similarity measure
by Old_Gray_Bear (Bishop) on Feb 27, 2009 at 16:56 UTC
    You need to define what you mean by "similarity".

    At first glance words 1, 2. and 4 are 'similar' since they each have the same number of sub-components. A second glance reveals that words 1, 2, and 3 are 'similar' - they each contain '101'. And words 2 and 4 are 'similar', they are the only words that contain 148 and 131.

    I suspect that once you have defined your terms, you will be able to write a function that takes two words and returns the degree of 'similarity' between them. Once you have all of the pair-wise ratings computed, sort() will let you rank the papers from most alike to least.

    This sounds like the kind of problem a plagiarism detector is designed for.

    ----
    I Go Back to Sleep, Now.

    OGB

      This sounds like the kind of problem a plagiarism detector is designed for.

      If, in fact, that is what the OP is after, s/he may benefit from looking at the nodes mentioned here: Re: Finding plagarized content.


      Update: I rather think, OTOH, that the OP may be looking for something more like Ted Pedersen's SenseClusters (more)...

      HTH,

      planetscape
Re: word similarity measure
by etj (Curate) on Jun 03, 2022 at 22:18 UTC
Re: word similarity measure
by kennethk (Abbot) on Feb 27, 2009 at 16:33 UTC

    Ignoring subtleties about how you may have developed your keyword->index mapping, the easiest way to measure the similarity would be to generate a hash with your word identifiers as keys and then brute force a similarity array. Something like:

    @counts = (); for $i_word (1 .. $#words) { for $j_word (0 .. $i_word-1) { $count[$i_word][$j_word] = 0; foreach (keys %{$paper{$i_word}}) { if (exists $paper{$j_word}{$_} { $count[$i_word][$j_word]++; } } } }

    If you aren't familiar with lists of lists, take a gander at perllol.

Re: word similarity measure
by bruno (Friar) on Feb 27, 2009 at 16:32 UTC
    Do you know the words, or are they unknown to you?

    Do you assume that, because two words appeared in the same paper, they must be identical or similar in some way?

    What do you mean by 'keywords', and 'papers'? Do they have the same meaning as in scientific research, where 'paper' is a scientific article and 'keyword' is a research subject, organism of interest, or method?