Listing occurence of all residues

sreya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl experts, I request a help from you. I have a list of fasta format sequences, like

>header1

aaaaabbbb

ccccddd

>header2

ggggg

jjj

kkkk

etc... I want to count the frequency of all residues corresponding to each protein id given in the header.ie I need to count occurrence of X(X ="not sure" residues) and unusual amino acids too.If the occurrence of any residue is zero,I just give a blank. Like

ID A B C D G U X J K

ID_1 5 4 4 3 - - - - -

ID_2 - - - - 5 - - 3 4

...........

I have tried.But my code doesn't print it in the required format nor count accordingly rather it continuously count each residue till end of the file.I am biology student and not very good in programming.I am in the learning phase.

I am giving my code here. Please tell me where I am wrong in my code what I need to get the exact output. Thank you all for considering and reading my doubt.

open(FILE1,"e.txt")or die "can't open file for reading\n";

while (<FILE1>)

    {
      chomp;
      next if(/^\s*$/);
    
       my $FastaLine= $_;
  
           if ( $FastaLine =~ /^>sp\|(\w+\S+)\|/ )
                   {
                     $header = $FastaLine;
                   }
           else
                 #storing the sequences and appending the sequence lin
+es that come after each header
                 #and storing the sequence as values of $header
                 {
              $Fasta_split{$header} .= $FastaLine;
                 }




                if ( $header =~ /^>sp\|(\w+\S+)\|/ )
                 {
                    my $name = $1;
                #print "$name\t";
                   }
                   
         }          
                   
                   
              
            while  (($header,$Fasta_split{$header})=each(%Fasta_split 
+) ){
              
               if ( $header =~ /^>sp\|(\w+\S+)\|/ )
                 {
                    my $name = $1;
                    #print "$name.txt\n";
                 
                    
                    #print"$name\n$Fasta_split{$header}\n";
                    
                   my @words= split"", $Fasta_split{$header};
                   
                   
                foreach my $w(@words){
                       $count{$w}++;
                   }
                   
                 while (my($w,$c)=each(%count)){
                  print "$w:$c\t";  
                    }
                    
                print "\n";
                             
               }          
     

}
[download]

Comment on Listing occurence of all residues Download Code

Replies are listed 'Best First'.
Re: Listing occurence of all residues by Anonymous Monk on Mar 01, 2015 at 13:23 UTC
Welcome! Please see How do I post a question effectively? and kindly provide some short but representative sample input data along with the expected output for that sample input (both formatted inside `<code>` tags), with which one can reproduce the problem.	[reply] [d/l]
Re^2: Listing occurence of all residues by sreya (Initiate) on Mar 01, 2015 at 14:50 UTC
Thank you. I have updated the question.I think it is more clear now.	[reply]
Re^3: Listing occurence of all residues by pme (Monsignor) on Mar 01, 2015 at 15:06 UTC
Hi sreya, could you attach a sample 'e.txt'?	[reply]
Re^4: Listing occurence of all residues by sreya (Initiate) on Mar 01, 2015 at 17:14 UTC
Re^5: Listing occurence of all residues by pme (Monsignor) on Mar 01, 2015 at 18:57 UTC
Re: Listing occurence of all residues by 2teez (Vicar) on Mar 01, 2015 at 15:07 UTC
Hi sreya, I would have love to see the result your code was giving. However, If I may advice I will say use warnings and strict in your perl code ALWAYS. There are also several modern way of doing what you want done. Like your open, using a 3-arguments is preferred. All that been said, you could get around your code issue like so: `use warnings; use strict; use Data::Dumper; my %data; my $header; while (<DATA>) { chomp; next if /^$/; # skip on blanck line if (/^>\D+?(\d+?)$/) { $header = $1; } else { $data{$header}{$_}++ for split //, $_; } } print Dumper \%data; __DATA__ >header1 aaaaabbbb ccccddd >header2 ggggg jjj kkkk` [download] OUTPUT: `$VAR1 = { '1' => { 'a' => 5, 'b' => 4, 'c' => 4, 'd' => 3 }, '2' => { 'g' => 5, 'j' => 3, 'k' => 4 } };` [download] How to display the output as desired, that is for the OP! :) If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply] [d/l] [select]
Re: Listing occurence of all residues by Anonymous Monk on Mar 01, 2015 at 15:48 UTC
Thanks for the update; it would be good if you could put your sample input and output inside `<code>` tags as mentioned previously. Note that the regular expression `/^>sp\\|(\w+\S+)\\|/` does not match the sample data you provided, but I'm guessing that's just because the sample data is oversimplified. Some general things you should do: Use strict and warnings - important! Use perltidy Have a look at the Basic debugging checklist The code still has a couple of smaller issues, but the two things that are preventing it from working correctly are: When you write `while ( ($header, $Fasta_split{$header}) = each(%Fasta_split) ) {`, you are assigning to / re-using variables that you shouldn't re-use. You could improve this loop a bit using for, sort and keys: `for my $header (sort keys %Fasta_split) {` (but note you actually shouldn't re-use the variable name `$header` here!) You are re-using `%count` without clearing it first. The easiest way to fix it is to use a `%count` local to the loop, i.e. put `my %count;` inside the second loop. The output can be made to be a bit more organized also using the same method as above: `for my $w (sort keys %count) { print "$w:$count{$w}\t"; }` - But of course it's possible to get the output to look even more like what you want it to. The following code loops over the letters you want to output, and prints either the number, or a dash if that number is zero (or undefined). The list of letters could be simplified with Perl's `qw//`. `for my $w ("a","b","c","d","g","u","x","j","k") { print $count{$w}\|\|"-", " "; }` [download] Which gives: `5 4 4 3 - - - - - - - - - 5 - - 3 4` [download] Customizing that is left as an exercise to the reader :-) (one tip: see uc and lc)	[reply] [d/l] [select]


Perl: the Markov chain saw
	PerlMonks