Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

how to find this substring?

by Anonymous Monk
on Jun 08, 2012 at 11:40 UTC ( #975141=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks!
I have a project in which I need to read some DNA sequences and check if they start with 'ATG' and end in 'TAA', 'TAG' or 'TGA'. If not, I am supposed to find the desired substring within the original sequence.
Imagine the following string:
GTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAAGGGCAGGTAG +CGACCGTACTTTCCGCCCCCGCGAAAATTACCAACCATCTGGTGGCGATGATTGAAAAAACTATCGGCG +GTCAGGATGCTTTGCCGAATATCAGCGATGCCGAACGTATTTTTTCTGACCTGCTCGCAGGACTTGCCA +GCGCGCAGCCGGGATTCCCGCTTGCACGGTTGAAAATGGTTGTCGAACAAGAATTCGCTCAGATCAAAC +ATGTTCTGCATGGTATCAGCCTGCTGGGTCAGTGCCCGGATAGCATCAACGCCGCGCTGATTTGCCGTG +GCGAAAAAATGTCGATCGCGATTATGGCGGGACTCCTGGAGGCGCGTGGACATCGCGTCACGGTGATTG +ATCCGGTAGAAAAATTGCTGGCGGTGGGCCATTACCTTGAATCTACCGTCGATATCGCGGAATCGACTC +GCCGTATCGCCGCCAGCCAGATCCCAGCCGATCACATGATCCTGATGGCGGGCTTTACCGCCGGTAATG +AAAAGGGTGAACTGGTGGTGCTGGGCCGTAATGGTTCCGACTATTCCGCCGCCGTGCTGGCCGCCTGTT +TACGCGCTGACTGCTGTGAAATCTGGACTGACGTCGATGGCGTGTATACCTGTGACCCGCGTCAGGTGC +CGGACGCCAGGCTGCTGAAATCGATGTCCTACCAGGAAGCGATGGAACTCTCTTACTTCGGCGCCAAAG +TCCTTCACCCTCGCACCATTACGCCCATCGCCCAGTTCCAGATCCCCTGTCTGATTAAAAATACCGGTA +ATCCGCAGGCGCCAGGAACGCTGATCGGCGCGTCCAGCGACGATGATAACCTACCAGTTAAAGGGATCT +CTAACCTTAACAACATGGCGATGTTTAGCGTCTCCGGCCCGGGAATGAAAGGGATGATTGGGATGGCGG +CGCGTGTTTTCGCCGCCATGTCTCGCGCCGGGATCTCGGTGGTGCTCATTACCCAGTCCTCCTCTGAGT +ACAGCATCAGTTTCTGTGTGCCGCAGAGTGACTGCGCGCGTGCCCGCCGTGCGATGCAGGATGAGTTCT +ATCTGGAGCTGAAAGAGGGGCTGCTGGAGCCGCTGGCGGTTACGGAGCGGTTGGCGATTATCTCTGTTG +TCGGCGACGGTATGCGCACGCTACGCGGCATTTCAGCGAAATTCTTCGCCGCGCTGGCGCGGGCCAATA +TCAATATCGTGGCGATCGCTCAGGGATCTTCTGAGCGTTCCATTTCTGTGGTGGTGAATAACGACGATG +CCACCACCGGCGTGCGGGTAACGCACCAGATGCTGTTCAATACCGATCAGGTGATTGAAGTGTTTGTCA +TTGGCGTCGGCGGCGTCGGCGGCGCGCTACTGGAACAGCTTAAACGTCAGCAAACCTGGTTGAAGAACA +AGCACATCGATCTACGCGTGTGCGGCGTGGCGAACTCAAAGGCGTTGCTAACCAATGTGCATGGCCTGA +ATCTGGACAACTGGCAGGCGGAACTGGCGCAAGCGAACGCGCCGTTCAATCTGGGACGCTTAATTCGCC +TGGTGAAAGAATATCATCTACTCAATCCGGTGATTGTTGATTGCACCTCCAGTCAGGCGGTGGCCGACC +AGTATGCTGACTTCCTGCGTGAAGGATTCCATGTGGTGACGCCAAACAAGAAAGCGAACACCTCGTCGA +TGGACTACTACCATCAGCTACGTTTCGCCGCCGCGCAATCACGGCGCAAATTCTTGTATGACACCAACG +TCGGCGCCGGTTTGCCGGTAATCGAAAACCTGCAAAACCTGCTGAATGCGGGTGATGAACTGCAAAAAT +TTTCCGGCATTCTTTCCGGGTCGCTCTCTTTTATTTTCGGTAAACTGGAAGAGGGGATGAGTCTCTCAC +AGGCGACCGCCCTGGCGCGCGAGATGGGCTATACCGAACCCGATCCGCGCGACGATCTTTCCGGTATGG +ATGTGGCGCGGAAACTGTTGATCCTCGCCCGCGAGACGGGCCGCGAGCTGGAGCTTTCCGATATTGTGA +TTGAACCGGTGTTGCCGGACGAGTTTGACGCCTCCGGCGATGTGACCACCTTTATGGCGCATCTGCCGC +AGCTTGACGACGCGTTTGCCGCCCGTGTGGCGAAAGCTCGTGATGAAGGTAAGGTATTGCGCTATGTGG +GCAATATCGAAGAGGATGGCGTGTGCCGCGTGAAGATTGCCGAAGTTGATGGTAACGATCCGCTCTTCA +AAGTGAAAAACGGTTAAGAAAACGCGCTGGCGTTCTACAGCCACTATTATCAGCCCTTGCCGTTGGTGC +TGCGCGGCTACGGCGCAGGCAATGATGTGACGGCGGCGGGCGTGTTTGCCGATCTGTTACGGACCCTCT +CATGGAAGTTAGGAGTT
where 'ATG' and 'TAA' are somewhere within it but now in the start and end positions, as they should.
For finding the correct start position, I suppose I could do:
$seq='GTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAAGGGC +AGGTAGCGACCGTACTTTCCGCCCCCGCGAAAATTACCAACCATCTGGTGGCGATGATTGAAAAAACTA +TCGGCGGTCAGGATGCTTTGCCGAATATCAGCGATGCCGAACGTATTTTTTCTGACCTGCTCGCAGGAC +TTGCCAGCGCGCAGCCGGGATTCCCGCTTGCACGGTTGAAAATGGTTGTCGAACAAGAATTCGCTCAGA +TCAAACATGTTCTGCATGGTATCAGCCTGCTGGGTCAGTGCCCGGATAGCATCAACGCCGCGCTGATTT +GCCGTGGCGAAAAAATGTCGATCGCGATTATGGCGGGACTCCTGGAGGCGCGTGGACATCGCGTCACGG +TGATTGATCCGGTAGAAAAATTGCTGGCGGTGGGCCATTACCTTGAATCTACCGTCGATATCGCGGAAT +CGACTCGCCGTATCGCCGCCAGCCAGATCCCAGCCGATCACATGATCCTGATGGCGGGCTTTACCGCCG +GTAATGAAAAGGGTGAACTGGTGGTGCTGGGCCGTAATGGTTCCGACTATTCCGCCGCCGTGCTGGCCG +CCTGTTTACGCGCTGACTGCTGTGAAATCTGGACTGACGTCGATGGCGTGTATACCTGTGACCCGCGTC +AGGTGCCGGACGCCAGGCTGCTGAAATCGATGTCCTACCAGGAAGCGATGGAACTCTCTTACTTCGGCG +CCAAAGTCCTTCACCCTCGCACCATTACGCCCATCGCCCAGTTCCAGATCCCCTGTCTGATTAAAAATA +CCGGTAATCCGCAGGCGCCAGGAACGCTGATCGGCGCGTCCAGCGACGATGATAACCTACCAGTTAAAG +GGATCTCTAACCTTAACAACATGGCGATGTTTAGCGTCTCCGGCCCGGGAATGAAAGGGATGATTGGGA +TGGCGGCGCGTGTTTTCGCCGCCATGTCTCGCGCCGGGATCTCGGTGGTGCTCATTACCCAGTCCTCCT +CTGAGTACAGCATCAGTTTCTGTGTGCCGCAGAGTGACTGCGCGCGTGCCCGCCGTGCGATGCAGGATG +AGTTCTATCTGGAGCTGAAAGAGGGGCTGCTGGAGCCGCTGGCGGTTACGGAGCGGTTGGCGATTATCT +CTGTTGTCGGCGACGGTATGCGCACGCTACGCGGCATTTCAGCGAAATTCTTCGCCGCGCTGGCGCGGG +CCAATATCAATATCGTGGCGATCGCTCAGGGATCTTCTGAGCGTTCCATTTCTGTGGTGGTGAATAACG +ACGATGCCACCACCGGCGTGCGGGTAACGCACCAGATGCTGTTCAATACCGATCAGGTGATTGAAGTGT +TTGTCATTGGCGTCGGCGGCGTCGGCGGCGCGCTACTGGAACAGCTTAAACGTCAGCAAACCTGGTTGA +AGAACAAGCACATCGATCTACGCGTGTGCGGCGTGGCGAACTCAAAGGCGTTGCTAACCAATGTGCATG +GCCTGAATCTGGACAACTGGCAGGCGGAACTGGCGCAAGCGAACGCGCCGTTCAATCTGGGACGCTTAA +TTCGCCTGGTGAAAGAATATCATCTACTCAATCCGGTGATTGTTGATTGCACCTCCAGTCAGGCGGTGG +CCGACCAGTATGCTGACTTCCTGCGTGAAGGATTCCATGTGGTGACGCCAAACAAGAAAGCGAACACCT +CGTCGATGGACTACTACCATCAGCTACGTTTCGCCGCCGCGCAATCACGGCGCAAATTCTTGTATGACA +CCAACGTCGGCGCCGGTTTGCCGGTAATCGAAAACCTGCAAAACCTGCTGAATGCGGGTGATGAACTGC +AAAAATTTTCCGGCATTCTTTCCGGGTCGCTCTCTTTTATTTTCGGTAAACTGGAAGAGGGGATGAGTC +TCTCACAGGCGACCGCCCTGGCGCGCGAGATGGGCTATACCGAACCCGATCCGCGCGACGATCTTTCCG +GTATGGATGTGGCGCGGAAACTGTTGATCCTCGCCCGCGAGACGGGCCGCGAGCTGGAGCTTTCCGATA +TTGTGATTGAACCGGTGTTGCCGGACGAGTTTGACGCCTCCGGCGATGTGACCACCTTTATGGCGCATC +TGCCGCAGCTTGACGACGCGTTTGCCGCCCGTGTGGCGAAAGCTCGTGATGAAGGTAAGGTATTGCGCT +ATGTGGGCAATATCGAAGAGGATGGCGTGTGCCGCGTGAAGATTGCCGAAGTTGATGGTAACGATCCGC +TCTTCAAAGTGAAAAACGGTTAAGAAAACGCGCTGGCGTTCTACAGCCACTATTATCAGCCCTTGCCGT +TGGTGCTGCGCGGCTACGGCGCAGGCAATGATGTGACGGCGGCGGGCGTGTTTGCCGATCTGTTACGGA +CCCTCTCATGGAAGTTAGGAGTT'; if($seq=~/.*(ATG.*)/) {$substring_with_correct_start=$1;}

What could I do to find the correct ending of the string, i.e., to end in either 'TAA', 'TAG' or 'TGA'? This part is troubling me, since there could me more than one correct codes as endings...

Comment on how to find this substring?
Select or Download Code
Re: how to find this substring?
by roboticus (Canon) on Jun 08, 2012 at 11:53 UTC

    Something like this should work:

    if ($seq =~ /(ATG.*(TAA|TAG|TGA))/) { $subseq = $1; }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: how to find this substring?
by ckj (Chaplain) on Jun 08, 2012 at 12:24 UTC
    Assuming $seq as the string provided by you. Try this:
    $seq='GTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGG..........GGAGTT'; $a = 'ATG'; $b = '(TAA|TAG|TGA)'; while($seq =~ /($a(.*?)$b)/g) { print $1."\n\n\n"; }

      From the OP:

      I need to read some DNA sequences and check if they start with 'ATG' and end in 'TAA', 'TAG' or 'TGA'

      Given the sequence:

      ATGXXXTAAXXXTAA

      your code outputs:

      ATGXXXTAA

      whereas it should print the whole string (or 'true', or whatever).

        (o.O)

        So, why can't you change line 6 to be print "whatever\n\n";?

Re: how to find this substring?
by temporal (Pilgrim) on Jun 08, 2012 at 13:24 UTC

    Oops, look like this got solved already.

    Losta genetics-related posts!

    Strange things are afoot at the Circle-K.

Re: how to find this substring?
by RichardK (Priest) on Jun 08, 2012 at 14:32 UTC

    I wonder if index & rindex might do a better job, particularly if the strings to search are large.

    so something along these lines

    my $start = index ($dna,'ATG'); my $end1 = rindex($dna,'TAA'); my $end2 = rindex($dna,'TAG'); ...

    Obviously, you'd need a little logic to handle the different stop strings, but I've left that as an exercise for the reader ;)

      Index is in my testing 10x faster than the regexp of the same search. So 1 or 2 indexes and 2-3 control branches and a substr is faster than the regexp that does the same. Always use index where possible.

        No. Only use index where the code is clearer or there is a clear and necessary speed advantage.

        In other words: in general write for clarity (and thus correctness and maintainability) first. If speed is an issue focus on the bang for buck areas of the code using profiling and address bottle necks as they become apparent and significant.

        True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://975141]
Approved by muba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2014-08-20 19:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (122 votes), past polls