Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

[SOLVED] Keeping track of substituted regions with a while loop

by enderk (Novice)
on Nov 15, 2013 at 21:57 UTC ( #1062805=perlquestion: print w/replies, xml ) Need Help??
enderk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks, I hope you are all having a good day. I coded up a short script to translate an RNA sequence (consisting of the letters A, U, G and C) into an amino acid sequence (contains 20 different letters).

There is a "genetic code" that can convert three letters from the RNA sequence (a "codon") to an amino acid (represented as one letter). This code can be found HERE. My script is the following:

use strict; use warnings; open READER, '<', 'rna.txt'; chomp(my $rna = <READER>); close READER; my %gencode; open GENCODE, '<', 'code.txt'; while (<GENCODE> =~ m/^([AUGC]{3}) (\w)?$/m) { my $codon = $1; my $aa; if (defined $2) { $aa = $2; } else { # $aa = 'STOP'; $aa = ''; } $gencode{$codon} = $aa; } close GENCODE; my $protein = $rna; while ($rna =~ m/([AUGC]{3})/g) { my $codon = $1; my $aa = $gencode{$codon}; $protein =~ s/$codon/$aa/; } open RESULTS, '>', 'results.txt'; print RESULTS "$protein\n"; close RESULTS;

$rna makes reference to a large RNA string (found here, if you are interested).

When running this script, I start off getting the correct sequence, until I reach position 629 in the protein sequence:

...NEWTAWFLNSPAAGPNQCQIVY... # Answer I should get ...NEWTAWFLNSPK PNQCQIVY... # My answer

Note that my answer does not have the space. I have included it in order to make the comparison between both strings easier.

Here is what I think has happened: in RNA, the AAG sequence becomes K in a protein sequence. At the same time, each of A, A and G are one-letter codes for amino acids in a protein (each of those letters could have originated from a three-letter code in the original RNA sequence). When Perl was substituting the RNA 3-letter codes with protein 1-letter codes, it mis-interpreted this AAG protein sequence as an RNA sequence and "resubstituted it" again.

I am confused because I thought that the Perl while loop keeps track of where in the sequence it has carried out a substitution, but perhaps this seems not to be the case? I know there are alternative ways of solving this problem (which I have solved using a different method), but I would like to know what is going on with my script. Does anyone know how to prevent Perl from substituting a part of a string that has already been substituted, even if this new part corresponds to the matching criteria in the regex expression? The documentations tells me to use the "c" modifier after the regex, but that does not work.

Replies are listed 'Best First'.
Re: Keeping track of substituted regions with a while loop
by choroba (Chancellor) on Nov 15, 2013 at 22:59 UTC
    You interpretation is correct. Use /g and Perl will keep track of the position in the string itself:
    $rna =~ s/(...)/$gencode{$1}/g;
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Wow, choroba, thanks for that! That gives me exactly what I wanted to do using a substitution regex expression. This means I don't need to use a while loop in my code as the g modifier will ensure the regex expression "walks" along the string.

      For future reference, in case anyone stumbles across this node, I'll paste the modified code below. Hopefully someone else will learn as much about regular expressions as I did today :)

      use strict; use warnings; open READER, '<', 'rna.txt'; chomp(my $rna = <READER>); close READER; my %gencode; open GENCODE, '<', 'code.txt'; while (<GENCODE> =~ m/^([AUGC]{3}) (\w)?$/m) { my $codon = $1; my $aa; if (defined $2) { $aa = $2; } else { # $aa = 'STOP'; $aa = ''; } $gencode{$codon} = $aa; } close GENCODE; my $protein = $rna; $protein =~ s/([AUGC]{3})/$gencode{$1}/g; open RESULTS, '>', 'results.txt'; print RESULTS "$protein\n"; close RESULTS;

      Also for the record, I know that now, I don't necessarily have to duplicate the $rna string into the $protein string, but I have left that in nevertheless as I think it makes the code conceptually easier to understand.

      Thanks a lot to both of you for trying to help me out

Re: Keeping track of substituted regions with a while loop
by kschwab (Priest) on Nov 15, 2013 at 22:51 UTC

    Based on your sample data, this seems more straightforward, and avoids the issue of substituting something that was already translated. Caveat: I'm guessing based on the sample data, so I may have missed something crucial.

    my $protein; # split $rna into groups of 3 characters for ( unpack("(A3)*",$rna)) { if (exists($gencode{$_}) { $protein.=$gencode{$_}; } else { die "No mapping for [$_]\n"; } }
      Missed a paren...
      for ( unpack("(A3)*",$rna)) { if (exists($gencode{$_})) { $protein.=$gencode{$_}; } else { die "No mapping for [$_]\n"; } }

      Hi kschwab, thanks for your reply. You're absolutely right, that is another way of solving this problem (by sequentially appending the one-letter codes to a new string instead of substituting an existing string) and it is similar to the "alternative" method I described in my original post.

      However, I was not looking so much into getting the right answer as rather trying to learn more about how to use regexes, combined with a while loop, in this situation. Is there an "easy" way for Perl to keep track of the position it is in, as it moves along a string in a while loop, or would that be an overkill to do, considering there are easier options available?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1062805]
Approved by GrandFather
[ambrus]: choroba: what type?

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2017-09-22 15:24 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (264 votes). Check out past polls.