How do I stop appending data

4pt8secs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, I'm new here so please be gentle.

I've adapted code to deal with turning DNA sequence into amino acid sequence. I have to read through the original file and keep one line as a header and the next is the sequence itself. The sequence is read as each three nucleotides is a single amino acid codon. So I must append the next amino acid codon to the rest of the protein sequence. It works well up until it gets to the next entry. Then it appends the amino acid to the previous protein and continues building it out. I would like it to start over again for each entry sequence it reads.

The original file looks like this...

>M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16
GTTACTGAGTGTGGTTGGCAGACTGCTGGTTGCCGTATGTAT
>M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16
CTTTCTCATTGTGATGTTAAGGATTGGATGTGCTGGCTTCTG
>M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16
GCTTGGACTTGTGTTGAGATTGATGGTCATTTCTCTATGAAT
...

The output looks like this...

>M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16
VTECGWQTAGCRMY
>M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16
VTECGWQTAGCRMYLSHCDVKDWMCWLL
>M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16
VTECGWQTAGCRMYLSHCDVKDWMCWLLAWTCVEIDGHFSMN
...

And here's my code...

#!/bin/perl-5.16.2/perl
##' A script to read in multiple FASTA entries in a single file,
##'     and translate the DNA sequence into protein sequence.
##'     I only plan on using a single reading frame starting from the 
+beginning.

use strict;
use warnings;

my $fasta_file=shift;
my $protein = '';
my $codon;
my $fh;
open($fh, $fasta_file) or die "can't open $fasta_file: $!\n";

my %sequence_data;
while (read_fasta_sequence($fh, \%sequence_data)) {
   #' Translate each three-base codon into amino acid, and append to a
+ protein
   for(my $i=0; $i < (length($sequence_data{seq}) - 2) ; $i += 3) {
       $codon = substr($sequence_data{seq},$i,3);
       $protein .= codon2aa($codon);
    }
   
   print ">$sequence_data{header}\n$protein\n\n";
}

sub read_fasta_sequence {
   my ($fh, $seq_info) = @_;

   $seq_info->{seq} = undef; # clear out previous sequence

   # put the header into place
   $seq_info->{header} = $seq_info->{next_header} if $seq_info->{next_
+header};

   my $file_not_empty = 0; 
   while (<$fh>) {
      $file_not_empty = 1;
      next if /^\s*$/;  # skip blank lines
      chomp;    

      if (/^>/) { # fasta header line
         my $h = $_;    
         $h =~ s/^>//;  
         if ($seq_info->{header}) {
            $seq_info->{next_header} = $h;
            return $seq_info;   
         }              
         else { # first time through only
            $seq_info->{header} = $h;
         }              
      }         
      else {    
         s/\s+//;  # remove any white space
         $seq_info->{seq} .= $_;
      }         
   }    

   if ($file_not_empty) {
      return $seq_info;
   }    
   else {
      # clean everything up
      $seq_info->{header} = $seq_info->{seq} = $seq_info->{next_header
+} = undef;

      return;   
   }    
}

#' codon2aa
#'
#' A subroutine to translate a DNA 3-character codon to an amino acid

sub codon2aa {
    my($codon) = @_;

       if ( $codon =~ /GC./i)        { return 'A' }    # Alanine
    elsif ( $codon =~ /TG[TC]/i)    { return 'C' }    # Cysteine
    elsif ( $codon =~ /GA[TC]/i)    { return 'D' }    # Aspartic Acid
    elsif ( $codon =~ /GA[AG]/i)    { return 'E' }    # Glutamic Acid
    elsif ( $codon =~ /TT[TC]/i)    { return 'F' }    # Phenylalanine
    elsif ( $codon =~ /GG./i)        { return 'G' }    # Glycine
    elsif ( $codon =~ /CA[TC]/i)    { return 'H' }    # Histidine
    elsif ( $codon =~ /AT[TCA]/i)    { return 'I' }    # Isoleucine
    elsif ( $codon =~ /AA[AG]/i)    { return 'K' }    # Lysine
    elsif ( $codon =~ /TT[AG]|CT./i)    { return 'L' }    # Leucine
    elsif ( $codon =~ /ATG/i)        { return 'M' }    # Methionine
    elsif ( $codon =~ /AA[TC]/i)    { return 'N' }    # Asparagine
    elsif ( $codon =~ /CC./i)        { return 'P' }    # Proline
    elsif ( $codon =~ /CA[AG]/i)    { return 'Q' }    # Glutamine
    elsif ( $codon =~ /CG.|AG[AG]/i)    { return 'R' }    # Arginine
    elsif ( $codon =~ /TC.|AG[TC]/i)    { return 'S' }    # Serine
    elsif ( $codon =~ /AC./i)        { return 'T' }    # Threonine
    elsif ( $codon =~ /GT./i)        { return 'V' }    # Valine
    elsif ( $codon =~ /TGG/i)        { return 'W' }    # Tryptophan
    elsif ( $codon =~ /TA[TC]/i)    { return 'Y' }    # Tyrosine
    elsif ( $codon =~ /TA[AG]|TGA/i)    { return '_' }    # Stop
    else {
    return '*';
    }
}
[download]

I think the fix should be easy, but it is eluding me. Thanks in advance for anyone's help.

Comment on How do I stop appending data Download Code

Replies are listed 'Best First'.
Re: How do I stop appending data by Kenosis (Priest) on Mar 07, 2013 at 18:18 UTC
toolic expertly provided the solution to your script's concatenation issue. However, consider using Bio::SeqIO for your Fasta-parsing needs: use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -fh => \ARGV, -format => 'Fasta' ); while ( my $seq = $in->next_seq() ) { my $protein; $protein .= codon2aa($_) for $seq->seq =~ /.../g; print '>' . $seq->id . "\n$protein\n"; } #' codon2aa #' #' A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my ($codon) = @_; if ( $codon =~ /GC./i ) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i ) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i ) { return 'D' } # Aspartic Aci +d elsif ( $codon =~ /GA[AG]/i ) { return 'E' } # Glutamic Aci +d elsif ( $codon =~ /TT[TC]/i ) { return 'F' } # Phenylalanin +e elsif ( $codon =~ /GG./i ) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i ) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i ) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i ) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]\|CT./i ) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i ) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i ) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i ) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i ) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.\|AG[AG]/i ) { return 'R' } # Arginine elsif ( $codon =~ /TC.\|AG[TC]/i ) { return 'S' } # Serine elsif ( $codon =~ /AC./i ) { return 'T' } # Threonine elsif ( $codon =~ /GT./i ) { return 'V' } # Valine elsif ( $codon =~ /TGG/i ) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i ) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]\|TGA/i ) { return '_' } # Stop else { return ''; } } [download] Output on your data set: `>M01096:4:000000000-A23M1:1:1101:15974:1529 VTECGWQTAGCRMY >M01096:4:000000000-A23M1:1:1101:16525:1548 LSHCDVKDWMCWLL >M01096:4:000000000-A23M1:1:1101:13838:1554 AWTCVEIDGHFSMN` [download]	[reply] [d/l] [select]
Re^2: How do I stop appending data by 4pt8secs (Novice) on Mar 07, 2013 at 19:05 UTC
Yeah, I'll need to delve into the BioPerl tools. Thanks for the example. That will be helpful too.	[reply]
Re^3: How do I stop appending data by Kenosis (Priest) on Mar 07, 2013 at 19:46 UTC
You're most welcome, 4pt8secs!	[reply]
Re: How do I stop appending data by toolic (Bishop) on Mar 07, 2013 at 18:00 UTC
Reset $protein at the top of your while loop: `while (read_fasta_sequence($fh, \%sequence_data)) { $protein = '';` [download] If that's not what you want, then show your expected output.	[reply] [d/l]
Re^2: How do I stop appending data by 4pt8secs (Novice) on Mar 07, 2013 at 18:45 UTC
Thanks much. That does it.	[reply]
Re: How do I stop appending data by arnaud99 (Beadle) on Mar 07, 2013 at 18:29 UTC
Hi Not as polished as the example above, but outputs the same results use strict; use warnings; my @processed; while (my $a_line = <DATA>) { chomp $a_line; if ($a_line =~ /^>/) { #this is a header, keep as is push @processed, $a_line; } else { # This is a dna seq, process #' Translate each three-base codon into amino acid, # and append to a protein my $protein = ''; my $len = length($a_line) -2; for(my $i=0; $i < $len ; $i += 3) { my $codon = substr($a_line, $i,3); $protein .= codon2aa($codon); } push @processed, $protein; } } #now display what we have processed print $_, "\n" for @processed; #' codon2aa #' #' A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]\|CT./i) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.\|AG[AG]/i) { return 'R' } # Arginine elsif ( $codon =~ /TC.\|AG[TC]/i) { return 'S' } # Serine elsif ( $codon =~ /AC./i) { return 'T' } # Threonine elsif ( $codon =~ /GT./i) { return 'V' } # Valine elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]\|TGA/i) { return '_' } # Stop else { return '*'; } } __DATA__ >M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16 GTTACTGAGTGTGGTTGGCAGACTGCTGGTTGCCGTATGTAT >M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16 CTTTCTCATTGTGATGTTAAGGATTGGATGTGCTGGCTTCTG >M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16 GCTTGGACTTGTGTTGAGATTGATGGTCATTTCTCTATGAAT [download] Output `>M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16 VTECGWQTAGCRMY >M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16 LSHCDVKDWMCWLL >M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16 AWTCVEIDGHFSMN` [download] Arnaud	[reply] [d/l] [select]
Re^2: How do I stop appending data by 4pt8secs (Novice) on Mar 07, 2013 at 19:53 UTC
Thanks Arnaud. It's good for me to see another way of approaching the problem.	[reply]
Re: [OT]: How do I stop appending data: Dispatch translation by AnomalousMonk (Archbishop) on Mar 07, 2013 at 22:09 UTC
This is OT, but here's an approch to doing the actual translation by lookup rather than by a large, nested `if...elsif` in a subroutine. Should be faster. Some attempts at data validation are made, but no guarantee this is bulletproof. Uses state feature and the `//` operator, both from 5.10+; if you don't have 5.10, let me know and I'll make work-arounds. Read more... (4 kB)	[reply] [d/l] [select]
Re: How do I stop appending data by Laurent_R (Canon) on Mar 07, 2013 at 21:53 UTC
As a side note to what has already been said, I think that using a dispatch table might be a better choice than your very long if/elsif/elsif... procedure. The keys could be the two first letters of your codon and the values either directly the relevant aminoacid or a coderef returning the right aminoacid, depending (when needed) on the third letter of the codon. I have never been working on this type of problems, so I never tried this type of solution on this type of problems, but I have the feeling that it might be more efficient. I also may be completely wrong. Just my 2 cents.	[reply]


We don't bite newbies here... much
	PerlMonks