Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Brainstorming session: detecting plagiarism

by blokhead (Monsignor)
on Jun 08, 2005 at 20:10 UTC ( #464824=note: print w/ replies, xml ) Need Help??


in reply to Brainstorming session: detecting plagiarism

I like the idea of comparing each pair of sentences for similarity. There are several metrics for sentence similarity that come to mind:

Edit distance & longest-common subsequence. These two are pretty similar: look for an edit distance less than a certain percent of the sentence length, or an LCS larger than a certain percent.

As you are doing now, I would do this on a word-by-word basis, and not character-by-character. However, these algorithms can be generalized a bit, so that instead of each pair of words either agreeing or disagreeing, each pair can have a fractional level of agreement between 0 and 1. If you implemented a "synonym measurer", you could easily plug this into such generalized LCS/Levenshtein algorithms. (This could also quite easily encompass changes in word stemming as well as synonimity.)

Another metric you can use for sentence similarity is Zaxo's favorite method: using compression & information theory, although you may not be able to pull out as much information about *how* the sentences are similar as in the algorithms above.

blokhead


Comment on Re: Brainstorming session: detecting plagiarism

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://464824]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2015-07-04 09:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (59 votes), past polls