Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Brainstorming session: detecting plagiarism

by blokhead (Monsignor)
on Jun 08, 2005 at 20:10 UTC ( #464824=note: print w/replies, xml ) Need Help??

in reply to Brainstorming session: detecting plagiarism

I like the idea of comparing each pair of sentences for similarity. There are several metrics for sentence similarity that come to mind:

Edit distance & longest-common subsequence. These two are pretty similar: look for an edit distance less than a certain percent of the sentence length, or an LCS larger than a certain percent.

As you are doing now, I would do this on a word-by-word basis, and not character-by-character. However, these algorithms can be generalized a bit, so that instead of each pair of words either agreeing or disagreeing, each pair can have a fractional level of agreement between 0 and 1. If you implemented a "synonym measurer", you could easily plug this into such generalized LCS/Levenshtein algorithms. (This could also quite easily encompass changes in word stemming as well as synonimity.)

Another metric you can use for sentence similarity is Zaxo's favorite method: using compression & information theory, although you may not be able to pull out as much information about *how* the sentences are similar as in the algorithms above.


  • Comment on Re: Brainstorming session: detecting plagiarism

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://464824]
[shmem]: It's common for some vendors to have column names such as WRSTVG or some other such whizzbang, and another table where these names are mapped to something meaningful depending on how you look at the data
[shmem]: afair in SAP that occurs all the time
[shmem]: afair in SAP that sort of indirection is sprinkled all over the database (for hysterical raisins)

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (11)
As of 2017-05-25 13:43 GMT
Find Nodes?
    Voting Booth?