Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

calculation of charged amino acids

by yuvraj_ghaly (Sexton)
on Jul 23, 2013 at 09:50 UTC ( [id://1045813]=perlquestion: print w/replies, xml ) Need Help??

yuvraj_ghaly has asked for the wisdom of the Perl Monks concerning the following question:

I have a Perl script which is used to calculate the charged amino acids. how would I modify to calculate numerous protein sequences in a fasta file

print "\n\n\t\#################### Count the number of acidic/basic/ne +utral amino acids #################### \n\n"; print "This script will count the number of acidic/basic/neutral amino + acids\n\n"; use strict; #variables my $count_of_acidic=0; my $count_of_basic=0; my $count_of_neutral=0; my @prot; my $prot_filename; my $line; my $sequence; my $aa; print "PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE:="; chomp($prot_filename=<STDIN>); open(PROTFILE,$prot_filename) or die "unable to open the file"; @prot=<PROTFILE>; close PROTFILE; foreach $line (@prot) { # discard blank line if ($line =~ /^\s*$/) { next; # discard comment line } elsif($line =~ /^\s*#/) { next; # discard fasta header line } elsif($line =~ /^>/) { next; # keep line, add to sequence string } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence s +tring $sequence =~ s/\s//g; @prot=split("",$sequence); #splits string into an array #print " \nThe original PROTEIN file is:\n$sequence \n"; while(@prot){ $aa = shift (@prot); if($aa =~/[DNEQ]/ig){ $count_of_acidic++; } if($aa=~/[KRH]/ig){ $count_of_basic++; } if($aa=~/[DNEQKRH]/ig){ $count_of_neutral++; } } print "\nNumber of acidic amino acids:".$count_of_acidic."\n"; print "Number of basic amino acids:".$count_of_basic."\n"; print "Number of neutral amino acids:".$count_of_neutral."\n";

Replies are listed 'Best First'.
Re: calculation of charged amino acids
by mtmcc (Hermit) on Jul 23, 2013 at 10:24 UTC
    Something like this should work (file is a text file containing all the fasta sequences):

    #!/usr/bin/perl -w use strict; use warnings; my $file = $ARGV[0]; my @currentProtein; my %protein; open (FASTA, "<", $file) || die "Can't open $file\n"; my $fastaLine; my $protSequence; while (<FASTA>) { #print STDERR "$_"; chomp; if (($_ !~ /^>/) && ($_ =~ /\w/)) { push (@currentProtein, $_); } if (/^>/) { $fastaLine = $_; } if ((($_ !~ /\w/) || (eof)) && (@currentProtein > 0)) { $protSequence = join("", @currentProtein); $protein{$fastaLine} = $protSequence; @currentProtein = (); } } close FASTA; foreach my $key (sort(keys %protein)) { my $count_of_acidic = 0; my $count_of_basic = 0; my $count_of_neutral = 0; my $aa; my $sequence = "$protein{$key}"; $sequence =~ s/\s//g; my @prot=split("",$sequence); #splits string into an array #print " \nThe original PROTEIN file is:\n$sequence \n"; while(@prot) { $aa = shift (@prot); if($aa =~/[DNEQ]/ig) { $count_of_acidic++; } if($aa=~/[KRH]/ig) { $count_of_basic++; } if($aa=~/[DNEQKRH]/ig) { $count_of_neutral++; } } print "\nName: $key\n"; print "Number of acidic amino acids:".$count_of_acidic."\n"; print "Number of basic amino acids:".$count_of_basic."\n"; print "Number of neutral amino acids:".$count_of_neutral."\n"; }

    I hope that helps.
      The 'g' switch on the three regular expressions shouldn't be there.

      If $aa =~/[DNEQ]/ig matches, then the following match, $aa=~/[KRH]/ig, fails (and the pos is reset so that the last expression, $aa=~/[DNEQKRH]/ig will match.

      However, if $aa=~/[KRH]/ig matches, then pos will be '1' and the following match, $aa=~/[DNEQKRH]/ig will fail because it will attempt to match beginning at pos 1 instead of pos 0.

      You can see this in the following code snippet.

      #!/usr/bin/perl use strict; use warnings; use 5.014; my @prot = qw/ D K /; my ($acid_cnt, $base_cnt, $neutral_cnt); while(@prot) { my $aa = shift (@prot); if($aa =~/[DNEQ]/gi) { ++$acid_cnt; } if($aa=~/[KRH]/gi) { ++$base_cnt; } say "$aa pos: ", pos($aa) // 'pos reset'; if($aa=~/[DNEQKRH]/gi) { ++$neutral_cnt; } }
      This prints

      C:\Old_Data\perlp>perl t7.pl aa D pos: pos reset aa K pos: 1

      This, in effect, counts the acid base in the neutral count but not the base.

      He probably wants the neutral count to be other than the acid or base, in which case, that regular expression should be, $aa=~/[^DNEQKRH]/i, negating the class.

      Without the 'g' switch, and negating the neutral class, the output would look like:

      C:\Old_Data\perlp>perl t9.pl test.fas Name: >DROME_HH_Q02937 Number of acidic amino acids:33 Number of basic amino acids:35 Number of neutral amino acids:136 Name: >DROME_HH_Q02938 Number of acidic amino acids:18 Number of basic amino acids:17 Number of neutral amino acids:69 Name: >DROTME_HH_Q02936 Number of acidic amino acids:14 Number of basic amino acids:18 Number of neutral amino acids:67
      Chris

      somehow it is calculating only one sequence.

        When I run that program on these fasta sequences:

        >DROTME_HH_Q02936 
        MRHIAHTQRCLSRLTSLVALLLIVLPMVFSPAHSCGPGRGLGRHRARNLY
        PLVLKQTIPNLSEYTNSASGPLEGVIRRDSPKFKDLVPNYNRDILFRDE
        
        >DROME_HH_Q02937 
        MRHIAHTQRCLSRLTSLVALLLIVLPMVFSPAHSCGPGRGLGRHRARNLY
        PLVLKQTIPNLSEYTNSASGPLEGVIRRDSPKFKDLVPNYNRDILFRDEE
        GTGADRLMSKRCKEKLNVLAYSVMNEWPGIRLLTTTTTTESWDEDYHHGQ
        YEGRAVTIATSDRDQSKYGMLARLAVEAGFDWVSYVSRRHIYCSVKSDSS
        ESLH
        
        >DROME_HH_Q02938 
        GTGADRLMSKRCKEKLNVLAYSVMNEWPGIRLLTTTTTTESWDEDYHHGQ
        YEGRAVTIATSDRDQSKYGMLARLAVEAGFDWVSYVSRRHIYCSVKSDSS
        ESLH
        

        I get this output:

        Name: >DROME_HH_Q02937 
        Number of acidic amino acids:33
        Number of basic amino acids:35
        Number of neutral amino acids:33
        
        Name: >DROME_HH_Q02938 
        Number of acidic amino acids:18
        Number of basic amino acids:17
        Number of neutral amino acids:18
        
        Name: >DROTME_HH_Q02936 
        Number of acidic amino acids:14
        Number of basic amino acids:18
        Number of neutral amino acids:14
        

        The fasta tags for each protein must be unique.

        so u used the same code which I input

Re: calculation of charged amino acids
by kcott (Archbishop) on Jul 23, 2013 at 10:15 UTC

    G'day yuvraj_ghaly,

    "I have a Perl script which is used to calculate the charged amino acids. how would I modify to calculate numerous protein sequences in a fasta file"

    Put the code that "is used to calculate the charged amino acids" into a subroutine. Call that subroutine for each of your "numerous protein sequences".

    -- Ken

      this is what Im trying to do but in vain

        "this is what Im trying to do but in vain"

        Provide: sample input, expected output, the code you're trying, a description of what you're having problems with, your actual output and any warning or error messages (all described in "How do I post a question effectively?"). Also, improve the readability of your code with indentation (described in perlstyle).

        If you do all this, I'll be happy to provide further assistance.

        -- Ken

        sample input is:

         perl count_charge_aa.pl <filename.extension>

        this is the code i'm suggested with

        #!/usr/bin/perl -w use strict; use warnings; my $file = $ARGV[0]; my @currentProtein; my %protein; open (FASTA, "<", $file) || die "Can't open $file\n"; my $fastaLine; my $protSequence; while (<FASTA>) { #print STDERR "$_"; chomp; if (($_ !~ /^>/) && ($_ =~ /\w/)) { #@currentProtein= $_; push (@currentProtein, $_); } if (/^>/) { $fastaLine = $_; } if ((($_ !~ /\w/) || (eof)) && (@currentProtein > 0)) { $protSequence = join("", @currentProtein); $protein{$fastaLine} = $protSequence; @currentProtein = (); } } close FASTA; foreach my $key (sort(keys %protein)) { my $count_of_acidic = 0; my $count_of_basic = 0; my $count_of_neutral = 0; my $aa; my $sequence = "$protein{$key}"; $sequence =~ s/\s//g; my @prot=split("",$sequence); #splits string into an array #print " \nThe original PROTEIN file is:\n$sequence \n"; while(@prot) { $aa = shift (@prot); if($aa =~/[DNEQ]/ig) { $count_of_acidic++; } if($aa=~/[KRH]/ig) { $count_of_basic++; } if($aa=~/[DNEQKRH]/ig) { $count_of_neutral++; } } print "\nName: $key\n"; print "Number of acidic amino acids:".$count_of_acidic."\n"; print "Number of basic amino acids:".$count_of_basic."\n"; print "Number of neutral amino acids:".$count_of_neutral."\n"; }

        Problem description

        I want to calculate the number of charges amino acids (acidic, basic and neutral) in each protein sequences which is present in fasta file. Note: I have a fasta file which have numerous protein sequences

        OUTPUT

        This script is calculating all the amino acids throughout the file and show the result in last protein sequence entry. Its something like that:

        >gi|1064567454|ref|YP_67854|lipoprotein Myxococcous strain DK6476
        Number of acidic amino acids: 312454
        Number of basic amino acids: 313534
        Number of neutral amino acids: 54454

        It is not calculating charged amino acids sequence by sequence but as a whole

Re: calculation of charged amino acids
by Anonymous Monk on Jul 23, 2013 at 09:57 UTC

    I have a Perl script which is used to calculate the charged amino acids. how would I modify to calculate numerous protein sequences in a fasta file

    Through SMOP :)

    Hooray, I'm helping :P

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1045813]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-26 08:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found