The background here is this story, which was on slashdot as well as NPR and major wire services yesterday; for those without the time to read it, a professor caught up to 120 students (out of around 500) cheating on term assignments by comparing their electronically-submitted essays for 6 or more word phrases that were repeated in multiple papers. Those found cheating despite the school's honor code will be either denied their diplomas or have their diplomas revoked if they're already graduated.
While the story is somewhat chiling, I also wonder exactly how the professor approached the programming part of this. I'm very much doubting he used perl... :-)
Given two English text strings, $a and $b, and
two integers $m and $n, 0 < $m <= $n. Both $a and $b have been stripped of punctuation and converted to lower case, leaving all characters as either ('a'..'z') or the space ' '.
Find the perl golf solution (fewest # of characters in code) that returns a list of phrases with at least $m but no more than $n words that are in both $a and $b.
update changed "$m < $n" to "$m <= $n"; shouldn't affect the golf solution, but makes sense if you want to find repeated phrases of only one size. Eg, if $m=$n=1, you could find all single words in common with both strings.
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
2001-05-10 Edit by Corion: Fixed title
Re (tilly) 1: Golf (Inspired): Repeated Phrases
by tilly (Archbishop) on May 10, 2001 at 17:34 UTC
|
I find this question unclear. Suppose that $m is 2 and $n is 3. Given that the phrase "this is strange" appears, do you want to see ("this is strange") returned or ("this is strange", "this is", "is strange")?
From both a programming and a spec problem I like only looking for phrases of a given size (which was what was done in the story). From that you can derive your solution through:
sub p{
($m,$n,@_)=@_;map{f($_,@_)}$m..$n
}
An incidental note. The mathematician in me reads your phrase "a list of" and wants to offer the following cheat as a "solution":
sub p{()}
(It is a list, and everything in it is a phrase of the right size that appears in both lists!)
UPDATE
And, of course, I can show that all golf can be improved with the utterly ridiculous:
sub p{}
| [reply] [d/l] [select] |
|
| [reply] |
Re: Golf (Inspired): Repeated Phrases
by chipmunk (Parson) on May 10, 2001 at 19:51 UTC
|
Here's a solution to start things off. I assume that runs of spaces in the strings have been squashed (tr/ //s;). For this solution, a phrase will appear in the results more than once if it occurs in the first string more than once and the second string at least once.
sub repeated {
my@p;$_=pop()-1 for$m,$n;
$_[1]=~($~=$1.$2)&&push@p,$~while$_[0]=~/(\w+)(?=(( \w+){$n,$m}))/g;
@p
}
95 characters.
Update: Shortened with the help of \b:
sub repeated {
my@p;$m=pop;$n=pop;
$_[1]=~$1&&push@p,$&while$_[0]=~/\b(?=((\w+\b ?){$n,$m}))/g;
@p
}
81 characters.
Update: Gotta remember to reset the array when using push! Thanks MeowChow! Added 5 characters to each of the above.
Update: MeowChow was actually pointing out a different problem... my code doesn't properly find phrases with less than the max number of words... I'm going to redo this completely and start a new sub thread when I get a working solution. (Darn you to heck, MeowChow! :D ) | [reply] [d/l] [select] |
|
I don't think this works quite right. Compare the output from the following two runs:
@l = repeated('this dog a round', 'this dog a pound', 1, 3);
@l = repeated('this dog a round', 'this dog a pound', 1, 4);
UPDATE: Here's an 85-byte version that returns all the matches:
sub repeated {
($s,$_,$m,$n)=@_;my@p;for$i($m..$n){$s=~$1&&push@p,$&while/\b(?=((\w+\
+b ?){$i}))/g}@p
}
MeowChow
s aamecha.s a..a\u$&owag.print | [reply] [d/l] [select] |
Re: Golf (Inspired): Repeated Phrases
by chipmunk (Parson) on May 10, 2001 at 21:01 UTC
|
Okay.... I was quite close with my original solution, but failed to fully test it. MeowChow pointed out the error in my approach; {$n,$m} will only match once, whereas I need it to match the whole range from $n to $m.
So, I just needed to add another loop, and voila:
sub repeated {
($_,$t,$n,$m,@p)=@_;
for$i($n..$m){$t=~$1&&push@p,$&while/\b(?=((\w+\b ?){$i}))/g}@p
}
So that's 83 characters long, with help from MeowChow's solution. | [reply] [d/l] [select] |
Re: Golf (Inspired): Repeated Phrases
by Corion (Patriarch) on May 10, 2001 at 20:05 UTC
|
My solution is not really a competitor, since I'm not that
familiar with Perl Golf customs, so bear with me.
use strict;
sub f {
my ($a,$b,$m,$n) = @_;
my (@r) = ();
my $l = " $a! $b";
while($l =~ s/(.*?)(( \w+){$m,$n} )(.*!.*\2.*)/ $1 =$4/) {
push @r, $2;
};
@r
};
Why I'm posting this at all is, that a previous version
exhibited unanticipated (by me, that is) behaviour :
# Previous version :
my $l = " $a! $b";
while($l =~ s/(.*)(( \w+){$m,$n} )(.*!.*\2.*)/ $1 =$4/) {
push @r, $2;
};
@r
That version only found the $m-word
matches, because the greedy (.*) at the beginning
forced the RE engine to work its way backwards through
the string. As soon as I sacrificed another character,
this problem went away. I haven't given the golfing much
thought (at work, I can only think about concepts, not
byte-fiddle with REs without raising suspicions ... ). | [reply] [d/l] [select] |
|
|