Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Progressive pattern matching

by tfrayner (Curate)
on Oct 17, 2001 at 14:11 UTC ( #119349=note: print w/replies, xml ) Need Help??

in reply to Progressive pattern matching

Okay, I think understand now. I also think that the following code should do what you want. It takes a sequence ($seq) and a motif (@motif, which may be degenerate) and finds all the matches of greater than 3 residues.

Again, this may be a suboptimal solution and I can't shake the nagging feeling that there's a simpler way.

Of course, that describes all my perl experiance to date :-)

Apologies to all and sundry for once again descending into biological jargon. It's the only way I can get my head round this stuff...

#!/usr/bin/perl -w use strict; use warnings; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); # set up motif array of arrays my @motifarray; for (my $e=0;$e<=$#motif;$e++){ my @elementarray= split (/ */, $motif[$e]); $motifarray[$e]=\@elementarray; } my $mstartpos = 0; # starting point within motif my $success = 0; # cycle through starting motif residues ("ST","P" etc.) MOTIFRES: while ($mstartpos+1 < $#motif){ # find all matches for a given starting motif residue my $test=$seq; my $lastmatchpos=0; while ($lastmatchpos < length($seq)){ my $found=''; # deal with the first 3 residue matches as a special case my @r0=@{$motifarray[$mstartpos]}; my @r1=@{$motifarray[$mstartpos+1]}; my @r2=@{$motifarray[$mstartpos+2]}; if ($test=~ /([@r0])(?=[@r1][@r2])/gc){ $found = $1; $lastmatchpos=pos($test); } # next motif starting residue if no further matches found unless ($found){ $mstartpos++; next MOTIFRES; } # get all the other residues in the motif for (my $e=$mstartpos+1;$e<=$#motifarray;$e++){ my @rn=@{$motifarray[$e]}; if ($test=~ /\G([@rn])/gc){ $found .= $1; } } # print out what we've got so far $success++; print ("$found at $lastmatchpos\n"); } # repeat, using the next motif residue as the new starting point $mstartpos++; } die ("No matches found.\n") unless ($success); print ("Total number of matches (nested or otherwise): $success\n");
Have fun,

Update: Minor bugfix; also removed a couple of superfluous and misconceived lines to tidy it up a bit.

Replies are listed 'Best First'.
Re: Re: Progressive pattern matching
by blakem (Monsignor) on Oct 17, 2001 at 15:33 UTC
    Ugh, its too late to finish my golf game....
    warning: not necessarily portable since it uses 'glob'.
    #!/usr/bin/perl -w use strict; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); my @a = my @b = my @c = @motif; my %m; while(my$r=!$|++&&\@a||pop@a&&\@a||shift@b&&\@b){for(glob('{'. join('',map{'{'.(join',',split(//)).'}'}@$r).'}')){length($_)> 2&&$seq=~/$_/&&$m{$_}++;}}for(sort{length($b)<=>length($a)}keys %m){print"$_ at ",index($seq,$_)+1,"\n"} =OUTPUT SPRIGLY at 8 PRIGLY at 9 PKLGIY at 2 SPRIGL at 8 KLGIY at 3 RIGLY at 10 SPRIG at 8 TPRLG at 21 TPRL at 21 IGLY at 11 LGIY at 4 SPRI at 8 GFY at 33 GIY at 5 TPR at 21 GLY at 12 SPR at 8


      Heh - nice.

      I knew there must be a shorter solution. I won't comment on whether it's simpler, but I know which I think is more readable :-). However, I note that your solution is in fact technically a little more complete than mine. My script misses substrings that aren't at the end of matches (i.e. will match GLY in SPRIGLY but not SPR). Neither script matches PRI, RIG or IGL in the above.

      Although why you might want to do this (save for the sake of programming elegance) I'm not sure :-P

      About my only contribution (since it's going to take me a little while to fully comprehend the golf) is that the @c array appears to be dispensable.


        It took me longer than I thought to get this working last night... I therefore had to cut the actual golfing part short. Here is an updated version after spending a few minutes trimming chars.
        #!/usr/bin/perl -w use strict; my $s = "APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @m = ( "ST", "P", "RK", "ILVF", "G", "ILVFM", "Y" ); @"=@'=@m;my%m;while($"=!$:--&&\@"||pop@"&&\@"||shift@'&&\@'){length($_ +)>2&& $s=~$_&&$m{$_}++for(glob'{'.join('',map{'{'.(join',',split(//)).'}'}@{ +$"}). '}')}for(sort{length($b)-length($a)}keys%m){print"$_ at ",index($s,$_) ++1,$/} =OUTPUT SPRIGLY at 8 PRIGLY at 9 PKLGIY at 2 SPRIGL at 8 KLGIY at 3 RIGLY at 10 SPRIG at 8 TPRLG at 21 TPRL at 21 IGLY at 11 LGIY at 4 SPRI at 8 GFY at 33 GIY at 5 TPR at 21 GLY at 12 SPR at 8
        Basically, we loop through a set of munged @motif arrays transforming them into glob strings which happen to spit out the various substrings we're looking for.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://119349]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2020-05-25 15:08 GMT
Find Nodes?
    Voting Booth?
    If programming languages were movie genres, Perl would be:

    Results (146 votes). Check out past polls.