in reply to Need help with code
I've taken your program and reworked it. I noticed a couple things. First, you don't keep your code neatly organized when you're working on it. This makes it harder to understand the code. If you keep your code organized as you write it, you'll find it easier to keep track of what's going on.
Specifically, you're not paying enough attention to the indentation level of your code. This obscures the logic of your code. You're also not removing unsuccessful code, cluttering the display and further obscuring the logic.
The second thing I noticed is that your approach to debugging appears to be based on 'treating symptoms' vs. 'curing the disease'. So if one bit of code is doing something unwanted, you undo that bit later, rather than adjusting how you're doing it in the first place.
For a specific example, let's look at your proteintrans subroutine. Let's clean up the code a bit. First, we'll get rid of the code you commented out, and fix the indentation[1]:
sub proteintrans { my ($dna) = @_; my $protein = ''; # split amino acids into 3 with for loop and concatenate for (my $i = 0; $i < (length($dna) - 2) ; $i +=3) { $protein .= codons(substr($dna, $i, 3)); } if ($protein =~ m/_/) { # if string $protein matches stop codon '_' # find match with index() # form a substring before stop codon with substr() # return new substring protein_ $index = index($protein, '_'); $protein_ = substr($protein, 0, $index); return $protein_; } else { # else return protein unmodified return $protein; } }
So in the code you're first building up a list of all the codons by chopping the dna string into 3-character chunks, looking up the codon and appending it to the protein string.
At this point you have a protein string that's longer than what you really want. Instead of altering how you generate your result, you're patching the result up after the fact: Specifically, you're looking for the ending codon and chopping it and everything following from the end of the string.
Sure, it works, but you then have to put in an explanation for why you're chopping up the string afterwards. You're also making the computer do a lot more work than necessary, too. It probably won't matter, but if you're doing this sort of search on a *lot* of *long* strings, it's possible that you might want it to be faster[2].
Let's instead change how we're building the protein: Let's just stop working when we hit the ending codon. Then we don't have to chop up the string afterwards.
sub proteintrans { my ($dna) = @_; # split amino acids into 3 with for loop and concatenate my $protein = ''; for(my $i = 0; $i < (length($dna) - 2) ; $i +=3) { ++$cnt; my $codon = codons(substr($dna, $i, 3)); # Return our protein if we hit our end marker return $protein if $codon eq '_'; $protein .= $codon; } # else return protein (rest of $dna) return $protein; }
All I did was to make a temporary variable ($codon) to hold the translated triplet. If it turns out that $codon holds the end token, we simply return the protein string that we've built so far. If we get to the end of the loop, we ran out of codons, so we return all we found.
Unfortunately, I've got to get to work, so I can't detail all the changes I made (and reasons for them, though much of it was simply formatting it to something easier on my eyes), so I'll just put my modified version here. If you have any questions about why I did something, just ask and I'll reply once I get a chance.
I hope you find this moderately helpful...
#!/usr/bin/perl -w use strict; use warnings; # DNA to scan... my $s1 = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTT +TTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'; my %genetic_code = ( 'GCA'=>'A', 'GCC'=>'A', 'GCG'=>'A', 'GCT'=>'A',# Alanine 'AGA'=>'R', 'AGG'=>'R', 'CGA'=>'R', 'CGC'=>'R',# Arginine 'CGG'=>'R', 'CGT'=>'R', # Arginine 'AAC'=>'N', 'AAT'=>'N', # Asparagine 'GAC'=>'D', 'GAT'=>'D', # Aspartic Acid 'TGC'=>'C', 'TGT'=>'C', # Cysteine 'CAA'=>'Q', 'CAG'=>'Q', 'GAA'=>'E', 'GAG'=>'E',# Glutamate 'GGA'=>'G', 'GGC'=>'G', 'GGG'=>'G', 'GGT'=>'G',# Glycine 'CAC'=>'H', 'CAT'=>'H', # Histidine 'ATA'=>'I', 'ATC'=>'I', 'ATT'=>'I', # Isoleucine 'CTA'=>'L', 'CTC'=>'L', 'CTG'=>'L', 'CTT'=>'L',# Leucine 'TTA'=>'L', 'TTG'=>'L', # Leucine 'AAA'=>'K', 'AAG'=>'K', # Lysine 'ATG'=>'M', # Methionine 'CCA'=>'P', 'CCC'=>'P', 'CCG'=>'P', 'CCT'=>'P',# Phynyalanine 'TTC'=>'F', 'TTT'=>'F', # Phynyalanine 'AGC'=>'S', 'AGT'=>'S', 'TCA'=>'S', 'TCC'=>'S',# Serine 'TCG'=>'S', 'TCT'=>'S', # Serine 'ACA'=>'T', 'ACC'=>'T', 'ACG'=>'T', 'ACT'=>'T',# Serine 'TGG'=>'W', # Tryptophan 'TAC'=>'Y', 'TAT'=>'Y', # Tyrosine 'GTA'=>'V', 'GTC'=>'V', 'GTG'=>'V', 'GTT'=>'V',# Valine 'TAA'=>'_', 'TAG'=>'_', 'TGA'=>'_', # exit ); print "DNA: $s1\n"; my $start = 'M'; my $end = '_'; my $idx = -1; while (my $prefix = substr($s1, ++$idx, 3)) { last if length $prefix < 3; # Check for the start indicator. next unless codons($prefix) eq 'M'; my $peptide = proteintrans($end, substr($s1, $idx)); next if !defined $peptide; print "\n\nIDX $idx: $peptide\n"; dbg_dump($s1, $idx, $peptide); } sub dbg_dump { my ($dna, $idx, $peptide) = @_; print $dna, "\n", " " x ($idx-1), '<'; print join(" ", split //, $peptide), ">\n"; } sub proteintrans { my ($end, $dna) = @_; my $protein = ''; for (my $i; $i<length($dna)-2; $i+=3) { my $codon = codons(substr($dna, $i, 3)); return $protein . "? undefined codon '$triplet'" if ! defined +$codon; return $protein if $codon eq $end; $protein .= $codon; } return $codon; } sub codons { my $codon = uc shift; return exists $genetic_code{$codon} ? $genetic_code{$codon} : undef; }
Notes:
[1] I used my favorite indentation style. You used at least two different ones (or the indendation just happened to match two different ones I've seen.)
[2] While playing with your code, I added a counter to the for loop where you build your protein string. Before my changes, it counted 83 iterations, and afterwards it counted 37 iterations. Again, not much difference in this case.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: A critique (was Re: Need help with code)
by disulfidebond (Novice) on Oct 21, 2012 at 00:31 UTC | |
by roboticus (Chancellor) on Oct 21, 2012 at 06:05 UTC |