A critique (was Re: Need help with code)

I've taken your program and reworked it. I noticed a couple things. First, you don't keep your code neatly organized when you're working on it. This makes it harder to understand the code. If you keep your code organized as you write it, you'll find it easier to keep track of what's going on.

Specifically, you're not paying enough attention to the indentation level of your code. This obscures the logic of your code. You're also not removing unsuccessful code, cluttering the display and further obscuring the logic.

The second thing I noticed is that your approach to debugging appears to be based on 'treating symptoms' vs. 'curing the disease'. So if one bit of code is doing something unwanted, you undo that bit later, rather than adjusting how you're doing it in the first place.

For a specific example, let's look at your proteintrans subroutine. Let's clean up the code a bit. First, we'll get rid of the code you commented out, and fix the indentation^[1]:

sub proteintrans {
    my ($dna) = @_;

    my $protein = '';
    # split amino acids into 3 with for loop and concatenate
    for (my $i = 0; $i < (length($dna) - 2) ; $i +=3) {
        $protein .= codons(substr($dna, $i, 3));
    }

    if ($protein =~ m/_/) {
        # if string $protein matches stop codon '_'
        # find match with index()
        # form a substring before stop codon with substr()
        # return new substring protein_
        $index = index($protein, '_');
        $protein_ = substr($protein, 0, $index);
        return $protein_;
    }
    else {
        # else return protein unmodified
        return $protein;
    }
}
[download]

So in the code you're first building up a list of all the codons by chopping the dna string into 3-character chunks, looking up the codon and appending it to the protein string.

At this point you have a protein string that's longer than what you really want. Instead of altering how you generate your result, you're patching the result up after the fact: Specifically, you're looking for the ending codon and chopping it and everything following from the end of the string.

Sure, it works, but you then have to put in an explanation for why you're chopping up the string afterwards. You're also making the computer do a lot more work than necessary, too. It probably won't matter, but if you're doing this sort of search on a *lot* of *long* strings, it's possible that you might want it to be faster^[2].

Let's instead change how we're building the protein: Let's just stop working when we hit the ending codon. Then we don't have to chop up the string afterwards.

sub proteintrans {
    my ($dna) = @_;

    # split amino acids into 3 with for loop and concatenate
    my $protein = '';
    for(my $i = 0; $i < (length($dna) - 2) ; $i +=3) {
        ++$cnt;
        my $codon = codons(substr($dna, $i, 3));
        # Return our protein if we hit our end marker
        return $protein if $codon eq '_';
        $protein .= $codon;
    }
    # else return protein (rest of $dna)
    return $protein;
}
[download]

All I did was to make a temporary variable ($codon) to hold the translated triplet. If it turns out that $codon holds the end token, we simply return the protein string that we've built so far. If we get to the end of the loop, we ran out of codons, so we return all we found.

Unfortunately, I've got to get to work, so I can't detail all the changes I made (and reasons for them, though much of it was simply formatting it to something easier on my eyes), so I'll just put my modified version here. If you have any questions about why I did something, just ask and I'll reply once I get a chance.

I hope you find this moderately helpful...

#!/usr/bin/perl -w
use strict;
use warnings;

# DNA to scan...
my $s1 = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT
+TTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG';

my %genetic_code = (
  'GCA'=>'A', 'GCC'=>'A', 'GCG'=>'A', 'GCT'=>'A',# Alanine
  'AGA'=>'R', 'AGG'=>'R', 'CGA'=>'R', 'CGC'=>'R',# Arginine
  'CGG'=>'R', 'CGT'=>'R',                        # Arginine
  'AAC'=>'N', 'AAT'=>'N',                        # Asparagine
  'GAC'=>'D', 'GAT'=>'D',                        # Aspartic Acid
  'TGC'=>'C', 'TGT'=>'C',                        # Cysteine
  'CAA'=>'Q', 'CAG'=>'Q', 'GAA'=>'E', 'GAG'=>'E',# Glutamate
  'GGA'=>'G', 'GGC'=>'G', 'GGG'=>'G', 'GGT'=>'G',# Glycine
  'CAC'=>'H', 'CAT'=>'H',                        # Histidine
  'ATA'=>'I', 'ATC'=>'I', 'ATT'=>'I',            # Isoleucine
  'CTA'=>'L', 'CTC'=>'L', 'CTG'=>'L', 'CTT'=>'L',# Leucine
  'TTA'=>'L', 'TTG'=>'L',                        # Leucine
  'AAA'=>'K', 'AAG'=>'K',                        # Lysine
  'ATG'=>'M',                                    # Methionine
  'CCA'=>'P', 'CCC'=>'P', 'CCG'=>'P', 'CCT'=>'P',# Phynyalanine
  'TTC'=>'F', 'TTT'=>'F',                        # Phynyalanine
  'AGC'=>'S', 'AGT'=>'S', 'TCA'=>'S', 'TCC'=>'S',# Serine
  'TCG'=>'S', 'TCT'=>'S',                        # Serine
  'ACA'=>'T', 'ACC'=>'T', 'ACG'=>'T', 'ACT'=>'T',# Serine
  'TGG'=>'W',                                    # Tryptophan
  'TAC'=>'Y', 'TAT'=>'Y',                        # Tyrosine
  'GTA'=>'V', 'GTC'=>'V', 'GTG'=>'V', 'GTT'=>'V',# Valine
  'TAA'=>'_', 'TAG'=>'_', 'TGA'=>'_',            # exit
);

print "DNA: $s1\n";
my $start = 'M';
my $end   = '_';

my $idx = -1;
while (my $prefix = substr($s1, ++$idx, 3)) {
    last if length $prefix < 3;

    # Check for the start indicator.
    next unless codons($prefix) eq 'M';

    my $peptide = proteintrans($end, substr($s1, $idx));
    next if !defined $peptide;
    print "\n\nIDX $idx: $peptide\n";
    dbg_dump($s1, $idx, $peptide);
}

sub dbg_dump {
    my ($dna, $idx, $peptide) = @_;
    print $dna, "\n", " " x ($idx-1), '<';
    print join("  ", split //, $peptide), ">\n";
}

sub proteintrans {
    my ($end, $dna) = @_;

    my $protein = '';
    for (my $i; $i<length($dna)-2; $i+=3) {
        my $codon = codons(substr($dna, $i, 3));
        return $protein . "? undefined codon '$triplet'" if ! defined 
+$codon;
        return $protein if $codon eq $end;
        $protein .= $codon;
    }
    return $codon;
}

sub codons {
    my $codon = uc shift;
    return exists $genetic_code{$codon}
                ? $genetic_code{$codon}
                : undef;
}
[download]

Notes:

[1] I used my favorite indentation style. You used at least two different ones (or the indendation just happened to match two different ones I've seen.)

[2] While playing with your code, I added a counter to the for loop where you build your protein string. Before my changes, it counted 83 iterations, and afterwards it counted 37 iterations. Again, not much difference in this case.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on A critique (was Re: Need help with code) Select or Download Code

Replies are listed 'Best First'.
Re: A critique (was Re: Need help with code) by disulfidebond (Novice) on Oct 21, 2012 at 00:31 UTC
It helped both my coding style and efficiency, thank you! I have a question, if I only wanted to get the 3-triplet amino acid sequence for each start/stop sequence, and not translate it, I modified the code but can't figure out what I'm doing wrong. The string is passed to the while loop, the conditions are checked to terminate and match for the start of the substring, the substring is passed to the subroutine. Inside the subroutine, it enters a for loop, where it makes a substring $codon, if codon matches a regex termination pattern, it returns $dna_string. Here's the code: #!/usr/bin/perl -w use strict; my $s1 = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT +"; print "$s1\n"; my $idx = -1; my $start = 'ATG'; while (my $prefix = substr($s1, ++$idx, 3)) { last if length $prefix < 3; # check for start indicator next unless $prefix eq 'ATG'; my $peptide = proteinseq(substr($s1, $idx)); print "$peptide\n"; } sub proteinseq { my ($dna) = @_; my $dna_seq = ''; for (my $i; $i < length($dna)-2; $i +=3) { my $codon = substr($dna, $i, 3); print "$codon\t"; return $dna_seq if ($codon =~ /TA[GA]\|TGA/); $dna_seq .= $codon; } } [download]	[reply] [d/l]
Re^2: A critique (was Re: Need help with code) by roboticus (Chancellor) on Oct 21, 2012 at 06:05 UTC
disulphidebond: I ran your program, but couldn't really tell what the output meant. Since you have two print statements in there, I couldn't always tell which print statement was generating which string. That made it a bit difficult to interpret. In cases like that, I usually decorate the strings to clarify things to myself. (When debugging it's important to know exactly what's happening. Sometimes I'll use the debugger, but usually, I just put a few print statements to tell me the values of intermediate variables. I usually put prefixes or odd punctuation in them to make things stand out.) So I first changed `print "$peptide\n"` to `print "\nPEPTIDE <$peptide>\n";` to put the result on it's own line. When I ran it, I saw this: `$ perl 1000161.pl AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT ATG TAG PEPTIDE <ATG> ATG GGG ATG ACC CCG CGA CTT GGA TTA GAG +TCT CTT PEPTIDE <> ATG ACC CCG CGA CTT GGA TTA GAG TCT CTT PEPTIDE <>` [download] When I look at it like this, I see that it's not returning the dna peptide when it's missing the ending sequence, nor does it include the ending codon when it's present. Adding the ending codon is really simple: The code adds the codon to the end of the peptide after it checks whether it found the ending codon. Swapping the last two lines in the for loop makes the peptide include the ending codon. Next, it seems rather anticlimactic to drop the last two peptides simply because the ending codon is missing. It's happening because the for loop runs out of dna, and we're not returning anything after the loop. So I added a return statement at the end of the loop. But to show that we didn't find the end codon, I add "*" to the peptide string to indicate that we hit the end of our dna string. $ cat 1000161.pl #!/usr/bin/perl -w use strict; my $s1 = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT +"; print "$s1\n"; my $idx = -1; my $start = 'ATG'; while (my $prefix = substr($s1, ++$idx, 3)) { last if length $prefix < 3; # check for start indicator next unless $prefix eq 'ATG'; my $peptide = proteinseq(substr($s1, $idx)); print "\nPEPTIDE <$peptide>\n"; } sub proteinseq { my ($dna) = @_; my $dna_seq = ''; for (my $i=0; $i < length($dna)-2; $i +=3) { my $codon = substr($dna, $i, 3); print "$codon\t"; $dna_seq .= $codon; # moved up a line # We're done if we found the ending codon return $dna_seq if ($codon =~ /TA[GA]\|TGA/); } # We hit end of our dna fragment return $dna_seq . ""; } $ perl 1000161.pl AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT ATG TAG PEPTIDE <ATGTAG> ATG GGG ATG ACC CCG CGA CTT GGA TTA GAG +TCT CTT PEPTIDE <ATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT> ATG ACC CCG CGA CTT GGA TTA GAG TCT CTT PEPTIDE <ATGACCCCGCGACTTGGATTAGAGTCTCTT> [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom