Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Is it possible to find the matching words and the percentage of matching words between two texts?

by supriyoch_2008 (Scribe)
on Dec 21, 2012 at 07:56 UTC ( #1009885=perlquestion: print w/ replies, xml ) Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I am interested to find out the matching words between an original text and a new text and the percentage of matching words over the total words in new text. For example, I have an original text like $a="Poet Blake had a milky white cat. He used to call it Pussy."; and a new text like $b="Poet Blake had a white cat and used to call it Pussy."; The words that matched between these two texts are 11 i.e. Poet, Blake, had, a, white, cat, used, to, call, it, Pussy. Moreover, there are 12 words in new text $b. Thus, the percentage of matching words over the new text will be=(11/12)*100 i.e. 91.67%. Is it possible to get these results using a perl program? This question is in continuation of one of my earlier nodes. I tried with a script that made use of a module called plagiarized.pm. but failed. I am at my wit's end to get the desired results. I got some suggestions in this regard from perlmonks earlier. But I failed to write a working script for the desired results. Is it possible to use a simple script in perl to find the matching words between two texts and the percentage of matching words?

Comment on Is it possible to find the matching words and the percentage of matching words between two texts?
Re: Is it possible to find the matching words and the percentage of matching words between two texts?
by McDarren (Abbot) on Dec 21, 2012 at 08:48 UTC

    A simple approach would be to build two hashes from the strings, and then compare the hashes.

    So you might do something like:
    my %foo; my $string = 'Poet Blake had a milky white cat. He used to call it Pus +sy.'; for my $word (split /\s+/, $string) { $foo{$word}++; }
    You do the same for the second string, and then to compare you simply iterate through one of the hashes and increment a counter if each word is present in the other hash. Something like so:
    my $cnt; for my $word (keys %foo) { $cnt++ if $bar{$word}; }

    To find the total number of words in either string, you simply count the number of keys in the hash, e.g.

    my $word_count = scalar keys %foo;

    And then it's just a simple calculation.
    Obvious question is how does your calculation look if the two strings contain a different number of words? But I'm sure you can decide that.

    hope this helps,
    Darren

      Hi McDarren,

      Thanks for your prompt reply. I shall try to solve my problem using the codes given by you.

      Regards

Re: Is it possible to find the matching words and the percentage of matching words between two texts?
by rovf (Priest) on Dec 21, 2012 at 09:29 UTC
    How do you want the following cases to be dealt with?

    Case 1:

    $a="a b c d e f"; $b="f e d c b a";

    Case 2:
    $a="a a a a a"; $b="a"
    -- 
    Ronald Fischer <ynnor@mm.st>

      Hi rovf

      Thanks for your quick reply. I need case 2. As a teacher, I want to find out to what extent any two students in my class have copied each other's assignment. Majority of the students (out of 30) are sincere and hard working. But it appears to me that nearly four students often plagiarize their written assignments i.e. I think they copy from others' assignments without visiting library or consulting textbooks/research papers. That is why I need a working perl script which can detect the degree of plagiarism adopted by the doubtful students. This is a very personal case. I just want to tell the students that I am not satisfied with their assignments should I detect more than 80% matched words. I don't know whether perl script can solve this problem faced by me. I want to make those (four) students more hard-working not only in studies but also in other spheres of life.

      Regards

        What you really need is to align the two texts with a "dynamic programming" algorithm. This is a common task in bioinformatics - but the atomic unit there is a single character - and there is a small number of expected characters (usually 4 or 20). You would have to hack it a fair bit to work with an array of words from an essentially unlimited "character set" - but I haven't looked in detail at the code:

        Bio::Tools::dpAlign

        For quick and dirty I would extend the hash comparison approach to handle words, word pairs, triplets and maybe more. Also maybe keep searching CPAN maybe there's something else out there.
Re: Is it possible to find the matching words and the percentage of matching words between two texts?
by MidLifeXis (Prior) on Dec 21, 2012 at 14:06 UTC

    Text::Cloze may provide a tool for you to use. There are also services out there that work in this problem space. Search for cloze plagiarism for some hits related to the process used by the previously mentioned module.

    --MidLifeXis

      Hi MidLifeXis,

      Thanks for providing information about Set-Cloze.</p?

      With Regards

Re: Is it possible to find the matching words and the percentage of matching words between two texts?
by karlgoethebier (Priest) on Dec 21, 2012 at 14:55 UTC
    #!/usr/bin/perl use strict; use warnings; use Set::Scalar; # use Data::Dumper; my @str1= qw(Poet Blake had a milky white cat He used to call it Pussy +); my @str2 = qw(Poet Blake had a white cat and used to call it Pussy); my $s1= Set::Scalar->new(@str1); my $s2 = Set::Scalar->new(@str2); my $result = $s1*$s2; print $result; __END__ (Blake Poet Pussy a call cat had it to used white)

    See Set::Scalar

    Perhaps this helps. Best regards, Karl

    Update: You wrote you are a teacher. So i think you can do the math by yourself ;-)

    Regards, Karl, Earl of GuttenPlag.

    P.S.: And diff might be a weapon too...

    «The Crux of the Biscuit is the Apostrophe»

      Hi karlgoethebier,

      Thanks for providing the code. It has solved my problem. Yes, I am a teacher and I shall do the math part.

      With kind regards

Re: Is it possible to find the matching words and the percentage of matching words between two texts?
by karlgoethebier (Priest) on Dec 21, 2012 at 18:18 UTC

    BTW: This issue is very interesting to me because i've been teaching for decades.

    Don't you think that it would be a better approach to tell the kids something about doing things right before you call the forensic spanish inquisition for Autodafé?

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1009885]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2014-12-18 18:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (59 votes), past polls