Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How do I stop appending data

by 4pt8secs (Novice)
on Mar 07, 2013 at 17:49 UTC ( [id://1022280]=perlquestion: print w/replies, xml ) Need Help??

4pt8secs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, I'm new here so please be gentle.

I've adapted code to deal with turning DNA sequence into amino acid sequence. I have to read through the original file and keep one line as a header and the next is the sequence itself. The sequence is read as each three nucleotides is a single amino acid codon. So I must append the next amino acid codon to the rest of the protein sequence. It works well up until it gets to the next entry. Then it appends the amino acid to the previous protein and continues building it out. I would like it to start over again for each entry sequence it reads.

The original file looks like this...

>M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16
GTTACTGAGTGTGGTTGGCAGACTGCTGGTTGCCGTATGTAT
>M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16
CTTTCTCATTGTGATGTTAAGGATTGGATGTGCTGGCTTCTG
>M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16
GCTTGGACTTGTGTTGAGATTGATGGTCATTTCTCTATGAAT
...

The output looks like this...

>M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16
VTECGWQTAGCRMY
>M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16
VTECGWQTAGCRMYLSHCDVKDWMCWLL
>M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16
VTECGWQTAGCRMYLSHCDVKDWMCWLLAWTCVEIDGHFSMN
...

And here's my code...

#!/bin/perl-5.16.2/perl ##' A script to read in multiple FASTA entries in a single file, ##' and translate the DNA sequence into protein sequence. ##' I only plan on using a single reading frame starting from the +beginning. use strict; use warnings; my $fasta_file=shift; my $protein = ''; my $codon; my $fh; open($fh, $fasta_file) or die "can't open $fasta_file: $!\n"; my %sequence_data; while (read_fasta_sequence($fh, \%sequence_data)) { #' Translate each three-base codon into amino acid, and append to a + protein for(my $i=0; $i < (length($sequence_data{seq}) - 2) ; $i += 3) { $codon = substr($sequence_data{seq},$i,3); $protein .= codon2aa($codon); } print ">$sequence_data{header}\n$protein\n\n"; } sub read_fasta_sequence { my ($fh, $seq_info) = @_; $seq_info->{seq} = undef; # clear out previous sequence # put the header into place $seq_info->{header} = $seq_info->{next_header} if $seq_info->{next_ +header}; my $file_not_empty = 0; while (<$fh>) { $file_not_empty = 1; next if /^\s*$/; # skip blank lines chomp; if (/^>/) { # fasta header line my $h = $_; $h =~ s/^>//; if ($seq_info->{header}) { $seq_info->{next_header} = $h; return $seq_info; } else { # first time through only $seq_info->{header} = $h; } } else { s/\s+//; # remove any white space $seq_info->{seq} .= $_; } } if ($file_not_empty) { return $seq_info; } else { # clean everything up $seq_info->{header} = $seq_info->{seq} = $seq_info->{next_header +} = undef; return; } } #' codon2aa #' #' A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine elsif ( $codon =~ /AC./i) { return 'T' } # Threonine elsif ( $codon =~ /GT./i) { return 'V' } # Valine elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop else { return '*'; } }

I think the fix should be easy, but it is eluding me. Thanks in advance for anyone's help.

Replies are listed 'Best First'.
Re: How do I stop appending data
by Kenosis (Priest) on Mar 07, 2013 at 18:18 UTC

    toolic expertly provided the solution to your script's concatenation issue. However, consider using Bio::SeqIO for your Fasta-parsing needs:

    use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -fh => \*ARGV, -format => 'Fasta' ); while ( my $seq = $in->next_seq() ) { my $protein; $protein .= codon2aa($_) for $seq->seq =~ /.../g; print '>' . $seq->id . "\n$protein\n"; } #' codon2aa #' #' A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my ($codon) = @_; if ( $codon =~ /GC./i ) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i ) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i ) { return 'D' } # Aspartic Aci +d elsif ( $codon =~ /GA[AG]/i ) { return 'E' } # Glutamic Aci +d elsif ( $codon =~ /TT[TC]/i ) { return 'F' } # Phenylalanin +e elsif ( $codon =~ /GG./i ) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i ) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i ) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i ) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]|CT./i ) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i ) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i ) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i ) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i ) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i ) { return 'R' } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i ) { return 'S' } # Serine elsif ( $codon =~ /AC./i ) { return 'T' } # Threonine elsif ( $codon =~ /GT./i ) { return 'V' } # Valine elsif ( $codon =~ /TGG/i ) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i ) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i ) { return '_' } # Stop else { return '*'; } }

    Output on your data set:

    >M01096:4:000000000-A23M1:1:1101:15974:1529 VTECGWQTAGCRMY >M01096:4:000000000-A23M1:1:1101:16525:1548 LSHCDVKDWMCWLL >M01096:4:000000000-A23M1:1:1101:13838:1554 AWTCVEIDGHFSMN

      Yeah, I'll need to delve into the BioPerl tools. Thanks for the example. That will be helpful too.

        You're most welcome, 4pt8secs!

Re: How do I stop appending data
by toolic (Bishop) on Mar 07, 2013 at 18:00 UTC
    Reset $protein at the top of your while loop:
    while (read_fasta_sequence($fh, \%sequence_data)) { $protein = '';

    If that's not what you want, then show your expected output.

      Thanks much. That does it.

Re: How do I stop appending data
by arnaud99 (Beadle) on Mar 07, 2013 at 18:29 UTC

    Hi

    Not as polished as the example above, but outputs the same results

    use strict; use warnings; my @processed; while (my $a_line = <DATA>) { chomp $a_line; if ($a_line =~ /^>/) { #this is a header, keep as is push @processed, $a_line; } else { # This is a dna seq, process #' Translate each three-base codon into amino acid, # and append to a protein my $protein = ''; my $len = length($a_line) -2; for(my $i=0; $i < $len ; $i += 3) { my $codon = substr($a_line, $i,3); $protein .= codon2aa($codon); } push @processed, $protein; } } #now display what we have processed print $_, "\n" for @processed; #' codon2aa #' #' A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine elsif ( $codon =~ /AC./i) { return 'T' } # Threonine elsif ( $codon =~ /GT./i) { return 'V' } # Valine elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop else { return '*'; } } __DATA__ >M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16 GTTACTGAGTGTGGTTGGCAGACTGCTGGTTGCCGTATGTAT >M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16 CTTTCTCATTGTGATGTTAAGGATTGGATGTGCTGGCTTCTG >M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16 GCTTGGACTTGTGTTGAGATTGATGGTCATTTCTCTATGAAT

    Output

    >M01096:4:000000000-A23M1:1:1101:15974:1529 1:N:0:16 VTECGWQTAGCRMY >M01096:4:000000000-A23M1:1:1101:16525:1548 1:N:0:16 LSHCDVKDWMCWLL >M01096:4:000000000-A23M1:1:1101:13838:1554 1:N:0:16 AWTCVEIDGHFSMN

    Arnaud

      Thanks Arnaud. It's good for me to see another way of approaching the problem.

Re: [OT]: How do I stop appending data: Dispatch translation
by AnomalousMonk (Archbishop) on Mar 07, 2013 at 22:09 UTC

    This is OT, but here's an approch to doing the actual translation by lookup rather than by a large, nested  if...elsif in a subroutine. Should be faster. Some attempts at data validation are made, but no guarantee this is bulletproof. Uses state feature and the  // operator, both from 5.10+; if you don't have 5.10, let me know and I'll make work-arounds.

Re: How do I stop appending data
by Laurent_R (Canon) on Mar 07, 2013 at 21:53 UTC

    As a side note to what has already been said, I think that using a dispatch table might be a better choice than your very long if/elsif/elsif... procedure. The keys could be the two first letters of your codon and the values either directly the relevant aminoacid or a coderef returning the right aminoacid, depending (when needed) on the third letter of the codon.

    I have never been working on this type of problems, so I never tried this type of solution on this type of problems, but I have the feeling that it might be more efficient. I also may be completely wrong. Just my 2 cents.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1022280]
Approved by Corion
Front-paged by Lotus1
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 14:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found