|Problems? Is your data what you think it is?|
Brainstorming session: detecting plagiarismby Ovid (Cardinal)
|on Jun 08, 2005 at 19:25 UTC||Need Help??|
Ovid has asked for the
wisdom of the Perl Monks concerning the following question:
Recently I discovered a couple of articles on a news Web site where one author had clearly plagiarized from another. After searching manually for other instances, I realized that this was something that needs to be automated. After doing a bit of research, I decided to write my own software to detect plagiarism. What follows is a rough first attempt at this.
There are two major problems with detecting plagiarism. The first is to collect the documents for comparison. The second is comparing those documents. Initially, I am focusing on the second problem.
My first attempt at the problem detects the plagiarism listed in the articles I found and it also detects some text that I deliberately plagiarized. The current usage is like this:
It can be tested with some hand-crafted plagiarism samples in my use.perl journal. The interface is a bit sloppy and I intend to clean that up, but what I'm looking for now is advice/suggestions for improving this code. (Be gentle, I know there are serious limitations in this.)
It works by converting individual words in sentences to one-character tokens (yeah, there's an ugly limit here), joining the tokens in a string and letting String::Approx see how far apart two strings are.
In the future, I plan to add stemming ("people/person", "children/child") and stop words, pretty HTML reports, etc. I also would like to allow people to specify minimum sentence length, different hash functions, how to split the text and which language the text is written in (for stemming, stop words and synonyms). Any advice here would be great.
One thing I would really like to do is handle word substitution. This involves knowing that the following two sentences could be indicative of plagiarism (taken from the link above):
That involves having a list of synonyms. I'm not sure of the best way to approach that, particularly if I'm trying to combine that with stemming since the latter is language specific. Also, perhaps word frequency is important, but I don't know how to account for this, either. Currently, "magnetosphere" and "cat" are equally important in comparison, even though "cat" is a far more common word.
Thanks in advance!