Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear monkes, i have protein file in fasta format and i ve written folloing code for finding frequency of each amino acid (alphabet), but there is a problem. the result file is very big. i mean i expect a very smaller file. please take a look at my code.

#!/usr/bin/perl -w use bio::seqIO; my $outputfile; #protein file bringing in. $outputfile="countaa"; unless(open(IN,">>$outputfile")) {print" cant open file \"$outputfile\ to write to !!\n\n"; exit;} my $proteinio=Bio::SeqIO->new (-file=>"ec 1.1.1.fasta",-format=>'fasta +'); while(my $seq=$proteinio ->next_seq() ){ foreach(my $protein=$seq->seq){ $count_A=0;$count_C=0;$count_D=0;$count_E=0;$count_F=0;$count_G=0;$cou +nt_H=0;$count_I=0; $count_K=0;$count_L=0;$count_M=0;$count_N=0;$count_P=0;$count_Q=0;$cou +nt_R=0;$count_S=0; $count_T=0;$count_V=0;$count_W=0;$count_Y=0; my @protein=split('',$protein); foreach (@protein){ if (/A/) {++$count_A;} elsif (/C/) {++$count_C;} elsif (/D/) {++$count_D;} elsif (/E/) {++$count_E;} elsif (/F/) {++$count_F;} elsif (/G/) {++$count_G;} elsif (/H/) {++$count_H;} elsif (/I/) {++$count_I;} elsif (/K/) {++$count_K;} elsif (/L/) {++$count_L;} elsif (/M/) {++$count_M;} elsif (/N/) {++$count_N;} elsif (/P/) {++$count_P;} elsif (/Q/) {++$count_Q;} elsif (/R/) {++$count_R;} elsif (/S/) {++$count_S;} elsif (/T/) {++$count_T;} elsif (/V/) {++$count_V;} elsif (/W/) {++$count_W;} elsif (/Y/) {++$count_Y;} print IN "$count_A $count_C $count_D $count_E $count_F $count_G $count +_H $count_I $count_K $count_L $count_M $count_N $count_P $count_Q $count_R $count_S $count_ +T $count_V $count_W $count_Y\n";} }} close IN;

Replies are listed 'Best First'.
Re: alphabet counting
by davido (Cardinal) on Jun 02, 2012 at 15:40 UTC

    It looks to me like you're printing inside of the tightest loop when you should be moving that print statement out to the same scope at which the counter variables are getting reset. You haven't really specified your problem clearly, but printing an accumulator several times before resetting it is sort of a red flag to me. Another red flag is that you're opening your output file in append mode, so each time you run the program you're just making the file bigger. Better to open it in "clobber and output" mode, unless you really have a need to just append.

    I've tried to rework your program in a way that is clearer to read, easier to maintain, and more idiomatic to a Perl programmer. I also changed your output filehandle's name from "IN" to "$OUT". When you open a file for output, you are just asking for confusion by calling its filehandle "IN". Here's some example code.

    use strict; use warnings; use bio::seqIO; my @tracked_proteins = qw( A C D E F G H I K L M N P Q R S T V W Y ); # Protein output file: my $output_file = 'countaa'; open my $OUT, '>', $output_file or die "Can't open file $output_file for output.\n$!"; my $proteinio = Bio::SeqIO->new( -file => "ec 1.1.1.fasta", -format => 'fasta' ); while ( my $seq = $proteinio->next_seq ) { foreach my $protein_seq ( $seq->seq ) { my %counts; @counts{@tracked_proteins} = (); my @protein = split //, $protein_seq; foreach my $protein_alpha (@protein) { next unless exists $counts{$protein_alpha}; $counts{$protein_alpha}++; } } print $OUT "$_ " for @counts{@tracked_proteins}; print $OUT "\n"; } close $OUT or die $!;

    The code could be simplified further if we knew that the only characters found in the input data stream are in the set of [ACDEFGHIKLMNPQRSTUVWY]. Since I know about as much about your problem domain as you do mine, I took the precaution of skipping characters that don't match that specific set.

    It's also possible that your accumulator should be reset at a broader scope, and that your print could be moved out one scope level too, but I made it a point to follow your lead on that one. This code is clearly written enough (in my opinion) that you should be able to change scoping of the accumulator and the print easily enough. You may find, as I do, that whenever a unit of functionality can fit on one screen within your editor, the problem becomes easier to visualize.

    Update: Fixed a print formatting issue... and it helps if a counter script actually increments a counter. ;)


Re: alphabet counting
by sauoq (Abbot) on Jun 02, 2012 at 13:44 UTC

    You are printing inside your loop.

    You want to accumulate your values and then only print them after you are done.

    As an aside, it would be more perlish if you were using a hash to accumulate your values... you could change all your ifs to a single line.

    And it's probably not a good idea to call the filehandle IN when you are using it for output.

    "My two cents aren't worth a dime.";

      i have problem with hash printing !!!

        i have problem with hash printing !!!

        You could guess that any such problems could be easily resolved... did you try something like this:

        print "$k => $h{$k}\n" for my $k (sort keys %h);

        "My two cents aren't worth a dime.";
Re: alphabet counting
by Corion (Pope) on Jun 02, 2012 at 13:21 UTC

    What have you done to get to the cause of your problem?

    Please describe in your own words what your program should do, and how many lines of output you expect. Then also describe in your own words what your program does. As a hint, it likely helps if you properly indent your source code so that for each loop, you put four spaces before all statements contained in the loop body:

    foreach( @protein) { if (...) { ... } elsif (...) { ... } ... };

    That way, it will likely become more clear to you where your error happens.

    Also, using a different data structure, like a hash, will make your program much shorter:

    my %count; foreach( @protein ) { $count{ $_ }++; ... };

      thank you. i have 428 KB fasta file such as text.there are 1000 protein in, i have to have 1000 line in result which give me the ferequency of each amino acid ( alphabet), but my result file is 14 MB which cannot handle with notepad.

        Instead of always testing with the whole set, I recommend testing with only two or three proteins. That way, your tests should run quicker and you should be able to inspect the results with notepad (for example). You can also print the results to the console instead of a file to see them.

Re: alphabet counting
by Cristoforo (Curate) on Jun 03, 2012 at 00:07 UTC
    Here is a solution to counting the letters taken from the BioPerl webpage, the 'HOWTO:Beginners' link on the opening page, then link 17 - (Obtaining basic sequence statistics). That led to the docs on Bio::Tools::SeqStats.

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; use Bio::Tools::SeqStats; my @prot = qw/ A C D E F G H I K L M N P Q R S T V W Y /; my $outputfile = "countaa"; open my $OUT, ">", $outputfile or die "Can't open file \"$outputfile\" to write to $!\n\n"; my $proteinio=Bio::SeqIO->new (-file=>"ec 1.1.1.fasta",-format=>'fasta +'); while(my $seq = $proteinio->next_seq() ) { my $seq_stats = Bio::Tools::SeqStats->new(-seq => $seq); my $count = $seq_stats->count_monomers(); print $OUT join(' ', map {$_ || 0} @$count{ @prot }), "\n"; } close $OUT or die $!;

    Hope this is of some help,