Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?

by rjt (Curate)
on Oct 07, 2019 at 17:36 UTC ( #11107142=note: print w/replies, xml ) Need Help??


in reply to How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?

You can count the number of occurrences of a particular character in a string with tr:

use 5.010; + my $str1 = 'MALSSTAATTSSKLKLSNPPSLSHTFTASASASVSNSTSFR'; my $a_count = $str1 =~ tr/A/A/; # 6

length will give you the length of the entire string.

Now to actually pull out the uppercase sequence from your sample input, are you reading lines from a file? Something like this would probably work:

#!/usr/bin/env perl use 5.010; for (<>) { if (/^>/) { # Header } elsif (/^[A-Z]+$/) { # Protein my $a = tr/A/A/; say "A: $a, length: " . length; } }

Then simply run it with script.pl < protein.txt. Modify the say ... line to taste, or more likely, replace it with the rest of your logic. You can also choose to parse the header if needed, in the # Header section.

You could of course modify this to actually open the file in your script with open instead, if that is more desirable:

open my $fh, '<', $filename or die "Couldn't open $filename: $!"; for (<$fh>) {
use strict; use warnings; omitted for brevity.
  • Comment on Re: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: How to count the length of a sequence of alphabets and number of occurence of a particular alphabet in the sequence?
by davi54 (Sexton) on Oct 07, 2019 at 18:08 UTC
    Hey, this worked perfect!! I have a follow-up question. Is there a way to get the output in a file instead of terminal? And is there a way to bin the outputs for different string lengths?
    Thanks a ton.

      These are indeed basic questions, as pointed out by stevieb. There are several ways to output to a file. Your operating system itself can probably do it with output redirection, or from within Perl, the open command can write to files instead of reading. Click on the links in the preceding sentence for more information and examples.

      To put the outputs into bins for different string lengths, hashes are an excellent way to do that. I would use a hash of array refs. The perldata page is again another great documentation resource that will introduce you to the required concepts. The basic algorithm in your case would be to do everything you're already doing, but instead of displaying the "A" count and length with say, you would instead store it in a hash. The hash key would be the length, and the hash value would be an array ref.

      Untested:

      my $len = length; $bins{$len} //= [ ]; # Set to blank array ref if not already set. push @{$bins{$len}}, $a; # Add $a to the array

      When it's all over, %bins (which you will of course need to declare before the main loop), has your A counts:

      See sort for an explanation of how to sort your data.

      for my $len (sort { $a <=> $b } keys %bins) { say "Length $len:"; say for @{$bins{$_}}; }

      Some assembly and individual research required. :-)

      use strict; use warnings; omitted for brevity.

      Welcome davi54!

      This site is about getting help for coding problems. Opening files, the question you just asked about is one of the most trivial and widely used pieces of functionality that any Perl programmer must learn early. Hint: once you get your file handle opened successfully, supply it to the print function before the information you want to print to it: print $fh "what I want to print to file\n";.

      Please do some of your own homework, and give this part a try instead of having others write all of your code for you.

      See open.

      For the latter question when you say "bin", do you mean trash bin? I don't quite understand what "bin the outputs" means.

        Hello, I am really sorry for not being specific. I should have displayed what I was trying. So here it is:
        #!/usr/bin/env perl use 5.010; for (<>) { if (/^>/) { # Header } elsif (/^[A-Z]+$/) { # Protein my $a = tr/A/A/; say "A: $a, length: " . length; } } ~

        There are two issues I am facing right now. First, some of the sequence entries in the input file are long and are continued on the next line (see below for example). But this script reads only the first line (before moving on to the second entry) due to which I'm getting wrong values for the length and number of 'A's that I want. Is there a way to fix this?

        Example sequence:
        >sp|P76347|YEEJ_ECOLI Uncharacterized protein YeeJ OS=Escherichia coli + (strain K12) OX=83333 GN=yeeJ PE=3 SV=3 MATKKRSGEEINDRQILCGMGIKLRRLTAGICLITQLAFPMAAAAQGVVNAATQQPVPAQ IAIANANTVPYTLGALESAQSVAERFGISVAELRKLNQFRTFARGFDNVRQGDELDVPAQ VSEKKLTPPPGNSSDNLEQQIASTSQQIGSLLAEDMNSEQAANMARGWASSQASGAMTDW LSRFGTARITLGVDEDFSLKNSQFDFLHPWYETPDNLFFSQHTLHRTDERTQINNGLGWR HFTPTWMSGINFFFDHDLSRYHSRAGIGAEYWRDYLKLSSNGYLRLTNWRSAPELDNDYE ARPANGWDVRAESWLPAWPHLGGKLVYEQYYGDEVA
        Second, This script is giving me the output on the terminal. I want it to give me the output in a file. How and where do I declare the output file details?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11107142]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-03-31 02:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    To "Disagree to disagree" means to:









    Results (179 votes). Check out past polls.

    Notices?