http://www.perlmonks.org?node_id=118889


in reply to Progressive pattern matching

This problem happens to be of interest to me as well. I think the following code does what you're getting at. It's a little crude, but the best I can do at this point is:
#!/usr/bin/perl use strict; use warnings; my $seq="ASPTFHKLDTPRLAKLJHHDFSDA"; my @pattern=("ST","P","RK","ILVF"); # array of refs to arrays of redundant # residues within the pattern my @patternarray; for (my $e=0;$e<=$#pattern;$e++){ my @elementarray= split (/ */, $pattern[$e]); $patternarray[$e]=\@elementarray; } my $found; my $lastmatchpos; LOOP: until ($found){ # deal with the first residue match as a special case my @resarray=@{$patternarray[0]}; $seq=~ /([@resarray])/gc; die("Sequence does not contain requested motif.\n") unless $1; $found = $1; $lastmatchpos=pos($seq); # all the other residues in the pattern for (my $e=1;$e<=$#patternarray;$e++){ my @resarray=@{$patternarray[$e]}; if ($seq=~ /\G([@resarray])/gc){ $found .= $1; }else{ #reset matching algorithm my $newmatchpos=$lastmatchpos+1; last if ($newmatchpos > length($seq)); pos($newmatchpos); $found=''; next LOOP; } } } print ("$found at $lastmatchpos\n");

This only matches the first occurrance of a motif in a given sequence. It should be possible to extend this to return all the matches with a little work. For use of the m/\G.../gc idiom, see perldoc:perlop.

Hope this helps,
Tim

Update: Sorry, I just re-read the original request and one of the nuances escaped me. To get the script to just print out the most it can match after having matched the initial residue, I think you can just change the else clause to:

}else{ next LOOP; }

Update to the update: To deal with the case where the first x residues in the motif don't match the target sequence, I think you should be able to do something like wrapping the LOOP block in another for loop to iterate over motif residues while looking for an initial match.

Hmmm. I'm still not sure I've quite got what your're looking for. At what point do you call a match significant? I.e. do you want target sequences matching only 4 motif residues or more, for example? Or will just a single matching residue do (which I doubt, but which is of course the easiest case)?