Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Brainstorming session: detecting plagiarism

by Ovid (Cardinal)
on Jun 08, 2005 at 19:25 UTC ( #464805=perlquestion: print w/ replies, xml ) Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Recently I discovered a couple of articles on a news Web site where one author had clearly plagiarized from another. After searching manually for other instances, I realized that this was something that needs to be automated. After doing a bit of research, I decided to write my own software to detect plagiarism. What follows is a rough first attempt at this.

There are two major problems with detecting plagiarism. The first is to collect the documents for comparison. The second is comparing those documents. Initially, I am focusing on the second problem.

My first attempt at the problem detects the plagiarism listed in the articles I found and it also detects some text that I deliberately plagiarized. The current usage is like this:

my $text = Text::Plagiarized->new; $text->original($original_text); foreach my $comparison (@comparison_texts) { # the following two lines should probably be collapsed to # $text->compare($comparison); $text->comparison($comparison); $text->analyze; print $text->percent, $/; # percent of matching sentences if ($text->percent > $some_threshold) { # arrayref of array refs with [$sentence, $possible_match] print Dumper($text->matches); } }

It can be tested with some hand-crafted plagiarism samples in my use.perl journal. The interface is a bit sloppy and I intend to clean that up, but what I'm looking for now is advice/suggestions for improving this code. (Be gentle, I know there are serious limitations in this.)

package Text::Plagiarized; $REVISION = '$Id: Plagiarized.pm,v 1.0 2003/07/13 19:15:57 ovid Exp $' +; $VERSION = '0.01'; use 5.006; use strict; use warnings; use String::Approx qw/amatch/; use Text::Sentence qw/split_sentences/; sub new { my $class = shift; my $self = bless { original => {}, comparison => {}, matches => [], threshold => 80, } => $class; } sub original { my ($self, $text) = @_; local $_ = $text; $self->{original} = { text => $text, sentences => [split_sentences($text)], }; return $self; } sub comparison { my ($self, $text) = @_; local $_ = $text; $self->{comparison} = { text => $text, sentences => [split_sentences($text)], }; return $self; } my %percentage = map { $_ => 1 } 0 .. 100; # wow. This is a cheap hac +k sub threshold { my $self = shift; if (@_) { my $num = shift; unless (exists $percentage{$num}) { require Carp; Carp::croak("threshold must be an integer between 0 and 10 +0, inclusive"); } $self->{threshold} = 100 - $num; } $self->{threshold}; } sub analyze { my $self = shift; my @sentences; my $threshold = $self->threshold; foreach my $sentence1 (@{$self->{original}{sentences}}) { foreach my $sentence2 (@{$self->{comparison}{sentences}}) { my ($hash1, $hash2) = _hash($sentence1, $sentence2); if ($hash1 eq $hash2 || amatch($hash1, ["$threshold%"], $h +ash2)) { push @sentences => [$sentence1 => $sentence2]; last; } } } $self->{matches} = \@sentences; } sub matches { shift->{matches} } sub percent { my $self = shift; my $precision = shift || 0; my $matches = @{$self->matches}; my $sentences = @{$self->{original}{sentences}}; sprintf "%.${precision}f" => ($matches/$sentences) * 100; } # starts to break down if we have more than 26 different words # use Unicode characters? # stop words? # memoize this sub _hash { my @string = map lc $_ => @_; s/[^[:alnum:][:space:]]//g foreach @string; s/[[:space:]]+/ /g foreach @string; my %words; my $letter = 'a'; s/(\S+)/ unless (exists $words{$1}) { $words{$1} = $letter; $letter++; } $words{$1} /eg foreach @string; s/ //g foreach @string; return @string; } 1;

It works by converting individual words in sentences to one-character tokens (yeah, there's an ugly limit here), joining the tokens in a string and letting String::Approx see how far apart two strings are.

In the future, I plan to add stemming ("people/person", "children/child") and stop words, pretty HTML reports, etc. I also would like to allow people to specify minimum sentence length, different hash functions, how to split the text and which language the text is written in (for stemming, stop words and synonyms). Any advice here would be great.

One thing I would really like to do is handle word substitution. This involves knowing that the following two sentences could be indicative of plagiarism (taken from the link above):

Macbeth is presented as a mature man of definitely established cha +racter. Macbeth is shown as an empowered man of well- established cha +racter.

That involves having a list of synonyms. I'm not sure of the best way to approach that, particularly if I'm trying to combine that with stemming since the latter is language specific. Also, perhaps word frequency is important, but I don't know how to account for this, either. Currently, "magnetosphere" and "cat" are equally important in comparison, even though "cat" is a far more common word.

Thanks in advance!

Cheers,
Ovid

New address of my CGI Course.

Comment on Brainstorming session: detecting plagiarism
Select or Download Code
Re: Brainstorming session: detecting plagiarism
by halley (Prior) on Jun 08, 2005 at 19:35 UTC
    Professors use essay-comparing software already. It's not new, but it does a pretty effective job. It's pretty straightforward to look for matching fragments, even tiny fragments like Markov chains or statistically improbable word pairs. However, if you compare some feeds, say, Reuters to Associated Press on the same day, even those criteria would fall apart.

    Separately, you might need to refine your personal definition of plagiarism. I'd say that the two examples below show that a fairly mechanical paraphrasing is going on, but it's not clear that it would rise to the definition of plagiarism. Also, how do you allow for proper quotations?

    Macbeth is presented as a mature man of definitely established cha +racter. Macbeth is shown as an empowered man of well- established cha +racter.

    --
    [ e d @ h a l l e y . c c ]

      Eventually, this code will be released with full documentation. One of the first things the documentation would make clear is that matching text does not necessarily mean plagiarism. Instead, the person looking at the text would have to compare the two documents (with the HTML linking I hope to provide) and determine for themselves if plagiarism took place. My software will not be able to tell whether or not someone gave proper credit for a particular passage.

      If the above was the only sentence in a 10,000 word document, I wouldn't say it's plagiarism. If that and several other sentences grouped together in one paragraph have a decent match and there's no attribution, then that's something which merits further study. Deciding whether or not plagiarism has occurred is not something software can do. It can merely flag likely candidates and will always have false positives and negatives.

      And I'm aware that professors already have software to do this. The free software I've seen is very limited. (One merely does a "longest substring" match.) I'd like to provide free tools for them.

      Cheers,
      Ovid

      New address of my CGI Course.

      By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice.

      Cheers,
      Ovid

      New address of my CGI Course.

        There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there.

        To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.

        --
        [ e d @ h a l l e y . c c ]

        You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters.

        Good luck, and keep us posted, please!

        planetscape
Re: Brainstorming session: detecting plagiarism
by zentara (Archbishop) on Jun 08, 2005 at 20:05 UTC
    Recently I discovered a couple of articles on a news Web site where one author had clearly plagiarized from another. After searching manually for other instances, I realized that this was something that needs to be automated

    :-) It sounds like you are saying the plagiarism needs automating.

    I think alot of students agree. ;-)


    I'm not really a human, but I play one on earth. flash japh
      Just what I was thinking! Great idea..!

      No doubt, as more intelligent anti-plagiarising software is developed there will be greater demand for auto-plagiarising software (that runs a thesauraus on every noun and verb and perhaps re-arranges sentences gramatically).

      There's a fine line, however, between detecting plagiarising and matching on anything approximate.. being too clever can sometimes cause headaches.

      At university I found it was mostly English-As-A-Second-Language students that cut and paste from other text books. So much it was painfully obvious to any native speaker of English (where text book language often differs from summary/report language).

      I don't envy the exam marker's job. Imagine having to read essays anyway!

        You are right... an improved "anti-plagiarism software" will result in advances in "plagiarism-detection-avoidance". It's sort of like a "cold-war" between teachers and students.

        I was watching one of those "TV-exposes" on "Buying term papers", and I can just see them selling "plagiarism-detection insurance" as an extra add-on cost. :-)

        Plus its a better business to be in.....there are always a fresh supply of "student-customers", but a limited market of teachers. ;-)


        I'm not really a human, but I play one on earth. flash japh
Re: Brainstorming session: detecting plagiarism
by blokhead (Monsignor) on Jun 08, 2005 at 20:10 UTC
    I like the idea of comparing each pair of sentences for similarity. There are several metrics for sentence similarity that come to mind:

    Edit distance & longest-common subsequence. These two are pretty similar: look for an edit distance less than a certain percent of the sentence length, or an LCS larger than a certain percent.

    As you are doing now, I would do this on a word-by-word basis, and not character-by-character. However, these algorithms can be generalized a bit, so that instead of each pair of words either agreeing or disagreeing, each pair can have a fractional level of agreement between 0 and 1. If you implemented a "synonym measurer", you could easily plug this into such generalized LCS/Levenshtein algorithms. (This could also quite easily encompass changes in word stemming as well as synonimity.)

    Another metric you can use for sentence similarity is Zaxo's favorite method: using compression & information theory, although you may not be able to pull out as much information about *how* the sentences are similar as in the algorithms above.

    blokhead

Re: Brainstorming session: detecting plagiarism
by CountZero (Bishop) on Jun 08, 2005 at 21:13 UTC
    An interesting approach! I particularly liked the simple and elegant way of "hashing" the sentences and calculating the "distance" between them.

    Another approach I was thinking of is moving a "sliding window" with a length of 3 or 4 words over the texts to be compared. Every list of 3 or 4 words is stored in a hash (with these words concatenated into one as the key and the number of occurences as the value) and thereafter the hashes of both texts are compared to each other; every "match" would give plagiarism-points and every non-match will give originality points: comparing the ratio of the plagiarism-score to the originality-score one can perhaps use this as another metric.

    It would be a more abstract metric than your comparison, but it would perhaps be less prone to small deliberate changes or use of synonyms. In that respect one could vary the "sliding window" to look at e.g. first, second and fourth words (skipping the third word) so as to account for substitutions.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Brainstorming session: detecting plagiarism
by bluto (Curate) on Jun 08, 2005 at 22:00 UTC
    ++. FWIW, over ten years ago a former grad professor of mine talked about detecting plagiarized source code, and it looks like he's published papers about it in the mean time. I imagine there is still a lot of research interest in that area as well (i.e. mainly because success brings $$$ to lawyers :-).
        Todd Proebsting a compiler guy. I'm pretty sure I found a reference to plagiarism detection just the other day, but am having trouble finding it now. So either I'm going senile or Google is playing mind games with me (probably both). In any case, there is a paper at that link about recovering java source from byte code, which is mildly on-topic. Also, check out his discussion of "Proebsting's Law" for compiler optimizations.
Re: Brainstorming session: detecting plagiarism
by BrowserUk (Pope) on Jun 08, 2005 at 22:02 UTC

    You might be interested in a technique I was playing with a year or so ago call I-match signatures. This involves performing similarity-based duplicate detection by using rolling "shingles" to produce a single hash value for a document. The technique is claimed (and seemed to be so to me) to be much less sensitive to simple transpositions of word ordering, than the distance-based values you are using.

    Basically it builds a lexicon and rates document closeness in terms of the ratio of rare terms relative to that terms frequency within the lexicon.

    Anyway, there is a web page with an extensive bibliography and one paper that I had bookmarked as very interesting,that you might like to read.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Brainstorming session: detecting plagiarism
by rob_au (Abbot) on Jun 09, 2005 at 00:04 UTC
    There was an excellent article on this very topic, using perl to detect plagiarism (specifically for perl code), in the March 2004 edition of The Perl Journal - If you have a login, it can be downloaded from http://www.tpj.com/documents/tpj0403b/

     

    perl -le "print unpack'N', pack'B32', '00000000000000000000001000000000'"

Re: Brainstorming session: detecting plagiarism
by sfink (Deacon) on Jun 09, 2005 at 00:36 UTC
Re: Brainstorming session: detecting plagiarism
by ww (Bishop) on Jun 09, 2005 at 03:49 UTC
Re: Brainstorming session: detecting plagiarism
by salva (Monsignor) on Jun 09, 2005 at 07:41 UTC
    Maybe, Google "Similar pages" links, could be used to get a first set of pages where to look for some specific plagiarized content.
Re: Brainstorming session: detecting plagiarism
by samizdat (Vicar) on Jun 09, 2005 at 12:37 UTC
    First off, Ovid, ++ for taking on a worthy challenge. I really like the thrust of the thread you've started and the code you've presented.

    My impression of plagiarism is that it's usually localized, and that it is very worthwhile to look for hot paragraphs. I can see that this would be an easy step forward for you from where you've left off.

    Kudos! :D
Re: Brainstorming session: detecting plagiarism
by Your Mother (Canon) on Jun 10, 2005 at 01:53 UTC

    Vector space comparisons, like I think the doc BrowserUK pointed to shows, are pretty easy. I played with them a bit (2 years back) and got really cool and easily tweaked results. I learned about it and got going from a nice article on Perl.com, "Building a Vector Space Search Engine in Perl."

    I really hope you'll keep us updated if you make progress into this. I was plagiarized at school once and it really stung. Go get 'em.

Re: Brainstorming session: detecting plagiarism
by Anonymous Monk on Jun 11, 2005 at 01:32 UTC
    There is a little experiment that is instructive in this case: Take a text (any text), get the histogram of its characters (that means how many of each are there), then of every pair (please note "abc" has the pairs "ab" and "bc"), then of any triple and so on. Obviously if you go all the way to the length of the text, it will be possible to reconstruct the text from the set of histograms. Now the real test begins: How large a set (how many histograms) do you need to reconstruct the text (approximately)?
    To reconstruct the text use a random number generator to output letters checking that all statistic properties of the set of histograms are met by the constructed string.
    The interesting result is that most texts need only 9 histograms. What if you only compare the histograms?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://464805]
Front-paged by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (12)
As of 2014-07-25 17:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls