http://www.perlmonks.org?node_id=999839

disulfidebond has asked for the wisdom of the Perl Monks concerning the following question:

Hello, this is my first time on Perl Monks. I'm a grad student practicing Bioinformatics for my dissertation. I found out about Rosalind beta bioinformatics problems (http://rosalind.info, no affiliation posted for interested people only) and decided to give them a shot. My problem is I am trying to parse a given string $s1, and find substrings designated by specific beginning and ending sequences. My code compiles, runs, returns strings which appear to be correct, but is "wrong" according to the website. If anyone can help, my question is if anyone can see any bugs in the code, and especially if anyone can suggest ways to optimize my code, I would greatly appreciate it. Thanks!
#!/usr/bin/perl -w # use Data::Dumper; $s1 = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTG +GAATAAGCCTGAATGATCCGAGTAGCATCTCAG'; print "$s1\n"; # copy string $s1 to array @arr_ @arr_ = split ('', $s1); # must test for initial methionine # assign @arr_ array to string $arr_met to test $arr_met = join('', @arr_); $sub_met = substr($arr_met, 0, 3); if ($sub_met =~ m/ATG/) { # if match is TRUE, call &proteintrans to translate # else do nothing my $peptide = proteintrans($arr_met); print "\n$peptide\n"; } else { print "\nno initial match, moving along...\n"; } # for each iteration of loop # shift $j element off front of array to alter reading frame # assign array @arr_ to string $dna_string for processing # call &testformethionine subroutine print "@arr_\n"; $len = @arr_; for ($j = 0; $j < $len; $j++) { shift @arr_; $dna_string = join('', @arr_); # debugging step # print "\n$dna_string\n"; testformethionine($dna_string); } sub testformethionine { my ($dnastr) = @_; # create substring from first 3 amino acids my $sub1 = substr($dnastr, 0, 3); # print "$sub1\n"; # if matches ATG (Met) proceed through translation &proteintrans # else do nothing if ($sub1 =~ m/ATG/) { my $peptide = proteintrans($dnastr); print "$peptide\n"; } else { # debugging line of code # print "move along, polymerase\n"; } } sub proteintrans { my ($dna) = @_; my $protein = ''; # split amino acids into 3 with for loop and concatenate for(my $i = 0; $i < (length($dna) - 2) ; $i +=3) { $protein .= codons(substr($dna, $i, 3)); } # return $protein; # $len_ = length($protein); if ($protein =~ m/_/) { # if string $protein matches stop codon '_' # find match with index() # form a substring before stop codon with substr() # return new substring protein_ $index = index($protein, '_'); $protein_ = substr($protein, 0, $index); return $protein_; } else { # else return protein unmodified return $protein; } } sub codons { # use hash table to assign codons my($codon) = @_; $codon = uc $codon; my (%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine 'TAC' => 'Y', # Tyrosine 'TAT' => 'Y', # Tyrosine 'TAA' => '_', # exit 'TAG' => '_', # exit 'TGC' => 'C', # Cysteine 'TGT' => 'C', # Cysteine 'TGA' => '_', # exit 'TGG' => 'W', # Tryptophan 'CTA' => 'L', # Leucine 'CTC' => 'L', # Leucine 'CTG' => 'L', # Leucine 'CTT' => 'L', # Leucine 'GTT' => 'V', # Valine 'GTC' => 'V', # Valine 'GTA' => 'V', # Valine 'GTG' => 'V', # Valine 'GCT' => 'A', # Alanine 'GCC' => 'A', # Alanine 'GCA' => 'A', # Alanine 'GCG' => 'A', # Alanine 'GAT' => 'D', # Aspartic Acid 'GAC' => 'D', # Aspartic Acid 'GAA' => 'E', # Glutamate 'GAG' => 'E', # Glutamate 'GGT' => 'G', # Glycine 'GGC' => 'G', # Glycine 'GGA' => 'G', # Glycine 'GGG' => 'G', # Glycine 'CCA' => 'P', # Phenylalanine 'CCC' => 'P', # Phenylalanine 'CCG' => 'P', # Phenylalanine 'CCT' => 'P', # Phenylalanine 'CAC' => 'H', # Histidine 'CAT' => 'H', # Histidine 'CAA' => 'Q', # Glutamine 'CAG' => 'Q', # Glutamine 'CGA' => 'R', # Arginine 'CGC' => 'R', # Arginine 'CGG' => 'R', # Arginine 'CGT' => 'R', # Arginine 'ATA' => 'I', # Isoleucine 'ATC' => 'I', # Isoleucine 'ATT' => 'I', # Isoleucine 'ATG' => 'M', # Methionine 'ACA' => 'T', # Threonine 'ACC' => 'T', # Threonine 'ACG' => 'T', # Threonine 'ACT' => 'T', # Threonine 'AAC' => 'N', # Asparagine 'AAT' => 'N', # Asparagine 'AAA' => 'K', # Lysine 'AAG' => 'K', # Lysine 'AGC' => 'S', # Serine 'AGT' => 'S', # Serine 'AGA' => 'R', # Arginine 'AGG' => 'R', # Arginine # hash table and subroutine from "Beginning Perl for Bioinformatics" # by James Tisdall. c2001 O'Reilly Press. ); if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else { print STDERR "$codon not found\n"; exit; } }

Replies are listed 'Best First'.
Re: Need help with code
by grondilu (Friar) on Oct 19, 2012 at 05:03 UTC

    You should use the 'g' regex modifier imho (see perlretut).

    #!/usr/bin/perl -w use strict; use warnings; use feature qw(say); use constant GENETIC_CODE => { TCA => 'S', # Serine TCC => 'S', # Serine TCG => 'S', # Serine TCT => 'S', # Serine TTC => 'F', # Phenylalanine TTT => 'F', # Phenylalanine TTA => 'L', # Leucine TTG => 'L', # Leucine TAC => 'Y', # Tyrosine TAT => 'Y', # Tyrosine TAA => '_', # exit TAG => '_', # exit TGC => 'C', # Cysteine TGT => 'C', # Cysteine TGA => '_', # exit TGG => 'W', # Tryptophan CTA => 'L', # Leucine CTC => 'L', # Leucine CTG => 'L', # Leucine CTT => 'L', # Leucine GTT => 'V', # Valine GTC => 'V', # Valine GTA => 'V', # Valine GTG => 'V', # Valine GCT => 'A', # Alanine GCC => 'A', # Alanine GCA => 'A', # Alanine GCG => 'A', # Alanine GAT => 'D', # Aspartic Acid GAC => 'D', # Aspartic Acid GAA => 'E', # Glutamate GAG => 'E', # Glutamate GGT => 'G', # Glycine GGC => 'G', # Glycine GGA => 'G', # Glycine GGG => 'G', # Glycine CCA => 'P', # Phenylalanine CCC => 'P', # Phenylalanine CCG => 'P', # Phenylalanine CCT => 'P', # Phenylalanine CAC => 'H', # Histidine CAT => 'H', # Histidine CAA => 'Q', # Glutamine CAG => 'Q', # Glutamine CGA => 'R', # Arginine CGC => 'R', # Arginine CGG => 'R', # Arginine CGT => 'R', # Arginine ATA => 'I', # Isoleucine ATC => 'I', # Isoleucine ATT => 'I', # Isoleucine ATG => 'M', # Methionine ACA => 'T', # Threonine ACC => 'T', # Threonine ACG => 'T', # Threonine ACT => 'T', # Threonine AAC => 'N', # Asparagine AAT => 'N', # Asparagine AAA => 'K', # Lysine AAG => 'K', # Lysine AGC => 'S', # Serine AGT => 'S', # Serine AGA => 'R', # Arginine AGG => 'R', # Arginine }; my $s = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTT +TGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'; for ($s =~ /ATG (?:[ACGT]{3})*? (?: TAA | TAG | TGA )/gx) { my @codons = /[ACGT]{3}/g; say join '', map GENETIC_CODE->{$_}, @codons; }

    This gives me the following output:

    M_
    MGMTPRLGLESLLE_
    

    PS. Thanks anyway for telling us about this site, it's pretty cool. It's kind like a game, except that you're doing some actual scientific work.

    I solved the problem with some perl6 code:

    use v6; constant DNA-codon = Hash.new: < TTT F CTT L ATT I GTT V TTC F CTC L ATC I GTC V TTA L CTA L ATA I GTA V TTG L CTG L ATG M GTG V TCT S CCT P ACT T GCT A TCC S CCC P ACC T GCC A TCA S CCA P ACA T GCA A TCG S CCG P ACG T GCG A TAT Y CAT H AAT N GAT D TAC Y CAC H AAC N GAC D TAA Stop CAA Q AAA K GAA E TAG Stop CAG Q AAG K GAG E TGT C CGT R AGT S GGT G TGC C CGC R AGC S GGC G TGA Stop CGA R AGA R GGA G TGG W CGG R AGG R GGG G >; sub revc($dna) { $dna.comb.reverse.join.trans: [<A C T G>] => [<T G A C>] } sub orf($dna) { my %match; my @match = gather for $dna, revc $dna { take .match: rx/ ATG [ <[ACGT]>**3 ]*? <before TAA|TAG|TGA> /, :ov +erlap; }; %match{ [~] map { DNA-codon{$_} }, .match: rx/ <[ACGT]>**3 /, :g }++ for @match; return %match.keys; } .say for orf open('rosalind_orf.txt').get;

    The "overlap" option was much useful.

A critique (was Re: Need help with code)
by roboticus (Chancellor) on Oct 19, 2012 at 14:11 UTC

    disulphidebond:

    I've taken your program and reworked it. I noticed a couple things. First, you don't keep your code neatly organized when you're working on it. This makes it harder to understand the code. If you keep your code organized as you write it, you'll find it easier to keep track of what's going on.

    Specifically, you're not paying enough attention to the indentation level of your code. This obscures the logic of your code. You're also not removing unsuccessful code, cluttering the display and further obscuring the logic.

    The second thing I noticed is that your approach to debugging appears to be based on 'treating symptoms' vs. 'curing the disease'. So if one bit of code is doing something unwanted, you undo that bit later, rather than adjusting how you're doing it in the first place.

    For a specific example, let's look at your proteintrans subroutine. Let's clean up the code a bit. First, we'll get rid of the code you commented out, and fix the indentation[1]:

    sub proteintrans { my ($dna) = @_; my $protein = ''; # split amino acids into 3 with for loop and concatenate for (my $i = 0; $i < (length($dna) - 2) ; $i +=3) { $protein .= codons(substr($dna, $i, 3)); } if ($protein =~ m/_/) { # if string $protein matches stop codon '_' # find match with index() # form a substring before stop codon with substr() # return new substring protein_ $index = index($protein, '_'); $protein_ = substr($protein, 0, $index); return $protein_; } else { # else return protein unmodified return $protein; } }

    So in the code you're first building up a list of all the codons by chopping the dna string into 3-character chunks, looking up the codon and appending it to the protein string.

    At this point you have a protein string that's longer than what you really want. Instead of altering how you generate your result, you're patching the result up after the fact: Specifically, you're looking for the ending codon and chopping it and everything following from the end of the string.

    Sure, it works, but you then have to put in an explanation for why you're chopping up the string afterwards. You're also making the computer do a lot more work than necessary, too. It probably won't matter, but if you're doing this sort of search on a *lot* of *long* strings, it's possible that you might want it to be faster[2].

    Let's instead change how we're building the protein: Let's just stop working when we hit the ending codon. Then we don't have to chop up the string afterwards.

    sub proteintrans { my ($dna) = @_; # split amino acids into 3 with for loop and concatenate my $protein = ''; for(my $i = 0; $i < (length($dna) - 2) ; $i +=3) { ++$cnt; my $codon = codons(substr($dna, $i, 3)); # Return our protein if we hit our end marker return $protein if $codon eq '_'; $protein .= $codon; } # else return protein (rest of $dna) return $protein; }

    All I did was to make a temporary variable ($codon) to hold the translated triplet. If it turns out that $codon holds the end token, we simply return the protein string that we've built so far. If we get to the end of the loop, we ran out of codons, so we return all we found.

    Unfortunately, I've got to get to work, so I can't detail all the changes I made (and reasons for them, though much of it was simply formatting it to something easier on my eyes), so I'll just put my modified version here. If you have any questions about why I did something, just ask and I'll reply once I get a chance.

    I hope you find this moderately helpful...

    Notes:

    [1] I used my favorite indentation style. You used at least two different ones (or the indendation just happened to match two different ones I've seen.)

    [2] While playing with your code, I added a counter to the for loop where you build your protein string. Before my changes, it counted 83 iterations, and afterwards it counted 37 iterations. Again, not much difference in this case.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      It helped both my coding style and efficiency, thank you! I have a question, if I only wanted to get the 3-triplet amino acid sequence for each start/stop sequence, and not translate it, I modified the code but can't figure out what I'm doing wrong. The string is passed to the while loop, the conditions are checked to terminate and match for the start of the substring, the substring is passed to the subroutine. Inside the subroutine, it enters a for loop, where it makes a substring $codon, if codon matches a regex termination pattern, it returns $dna_string. Here's the code:
      #!/usr/bin/perl -w use strict; my $s1 = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT +"; print "$s1\n"; my $idx = -1; my $start = 'ATG'; while (my $prefix = substr($s1, ++$idx, 3)) { last if length $prefix < 3; # check for start indicator next unless $prefix eq 'ATG'; my $peptide = proteinseq(substr($s1, $idx)); print "$peptide\n"; } sub proteinseq { my ($dna) = @_; my $dna_seq = ''; for (my $i; $i < length($dna)-2; $i +=3) { my $codon = substr($dna, $i, 3); print "$codon\t"; return $dna_seq if ($codon =~ /TA[GA]|TGA/); $dna_seq .= $codon; } }

        disulphidebond:

        I ran your program, but couldn't really tell what the output meant. Since you have two print statements in there, I couldn't always tell which print statement was generating which string. That made it a bit difficult to interpret. In cases like that, I usually decorate the strings to clarify things to myself. (When debugging it's important to know exactly what's happening. Sometimes I'll use the debugger, but usually, I just put a few print statements to tell me the values of intermediate variables. I usually put prefixes or odd punctuation in them to make things stand out.)

        So I first changed print "$peptide\n" to print "\nPEPTIDE <$peptide>\n"; to put the result on it's own line. When I ran it, I saw this:

        $ perl 1000161.pl AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT ATG TAG PEPTIDE <ATG> ATG GGG ATG ACC CCG CGA CTT GGA TTA GAG +TCT CTT PEPTIDE <> ATG ACC CCG CGA CTT GGA TTA GAG TCT CTT PEPTIDE <>

        When I look at it like this, I see that it's not returning the dna peptide when it's missing the ending sequence, nor does it include the ending codon when it's present. Adding the ending codon is really simple: The code adds the codon to the end of the peptide after it checks whether it found the ending codon. Swapping the last two lines in the for loop makes the peptide include the ending codon.

        Next, it seems rather anticlimactic to drop the last two peptides simply because the ending codon is missing. It's happening because the for loop runs out of dna, and we're not returning anything after the loop. So I added a return statement at the end of the loop. But to show that we didn't find the end codon, I add "***" to the peptide string to indicate that we hit the end of our dna string.

        $ cat 1000161.pl #!/usr/bin/perl -w use strict; my $s1 = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT +"; print "$s1\n"; my $idx = -1; my $start = 'ATG'; while (my $prefix = substr($s1, ++$idx, 3)) { last if length $prefix < 3; # check for start indicator next unless $prefix eq 'ATG'; my $peptide = proteinseq(substr($s1, $idx)); print "\nPEPTIDE <$peptide>\n"; } sub proteinseq { my ($dna) = @_; my $dna_seq = ''; for (my $i=0; $i < length($dna)-2; $i +=3) { my $codon = substr($dna, $i, 3); print "$codon\t"; $dna_seq .= $codon; # moved up a line # We're done if we found the ending codon return $dna_seq if ($codon =~ /TA[GA]|TGA/); } # We hit end of our dna fragment return $dna_seq . "***"; } $ perl 1000161.pl AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT ATG TAG PEPTIDE <ATGTAG> ATG GGG ATG ACC CCG CGA CTT GGA TTA GAG +TCT CTT PEPTIDE <ATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT***> ATG ACC CCG CGA CTT GGA TTA GAG TCT CTT PEPTIDE <ATGACCCCGCGACTTGGATTAGAGTCTCTT***>

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Need help with code
by McA (Priest) on Oct 19, 2012 at 08:32 UTC

    Hi,

    you call codons in a loop. Every time codons is called you instatiate the hash %genetic_code. This seems unnecessary. Put it outside the function body as a file scoped variable (aka file global variable).

    Best regards
    McA