Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Count the sequence length of each entry in the file

by davi54 (Sexton)
on Oct 01, 2020 at 20:19 UTC ( [id://11122436]=note: print w/replies, xml ) Need Help??


in reply to Re: Count the sequence length of each entry in the file
in thread Count the sequence length of each entry in the file

Hi, actually, my input file has multiple entries, where each entry looks like:

>sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcription-activation domain VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS

>sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusion sbmate HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS S

Replies are listed 'Best First'.
Re^3: Count the sequence length of each entry in the file
by GrandFather (Saint) on Oct 02, 2020 at 01:19 UTC

    So updating the sample with the example data gives:

    #!/usr/bin/perl use strict; use warnings; $/ = ''; # Set paragraph mode my @count; my %absent; my $name; while ( my $para = <DATA> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s*#.*//mg; my %prot; $para =~ s/([ACDEFGHIKLMNPQRSTVWY])/ ++$prot{ $1 } /eg; my $len = length($para); my $num = scalar keys %prot; push @count,[$num,$name]; printf "Counted %d for %s ..\n",$num,substr($name,0,50); print "$name\n"; print join( ' ', map "$_=$prot{$_}", sort keys %prot ), "\n"; printf "Amino acid alphabet = %d\n\n",$num ; print "Sequence length = $len\n"; # count absent for ('A'..'Z'){ ++$absent{$_} unless exists $prot{$_}; }; }; # sort names by count in ascending order to get lowest my @sorted = sort { $a->[0] <=> $b->[0] } @count; my $lowest = $sorted[0]->[0]; # maybe more than 1 lowest printf "Least number of amino acids is %d in these entries\n",$lowest; my @lowest = grep { $_->[0] == $lowest } @sorted; print "$_->[1]\n" for @lowest; # show all results print "\nAll results in ascending count\n"; for (@sorted){ printf "%d %s\n", @$_; }; print "\nExclusion of various amino acids is as follows\n"; for (sort keys %absent){ printf "%s=%d\n",$_,$absent{$_}; }; __DATA__ >sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcripti +on-activation domain VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS >sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusi +on sbmate HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS S

    which prints:

    Counted 19 for sp_0005_SySynthetic ConstructTumor protein p53 N-t .. sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcriptio +n-activation domain A=9 C=2 D=3 E=4 F=2 G=15 I=3 K=3 L=9 M=3 N=5 P=2 Q=10 R=8 S=12 T=6 V=8 + W=1 Y=5 Amino acid alphabet = 19 Sequence length = 124 Counted 20 for sp_0017_CaCamelidSorghum bicolor multidrug and tox .. sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusio +n sbmate A=10 C=2 D=4 E=4 F=2 G=15 H=1 I=2 K=4 L=7 M=2 N=5 P=3 Q=6 R=4 S=18 T=7 + V=10 W=5 Y=10 Amino acid alphabet = 20 Sequence length = 143 Least number of amino acids is 19 in these entries sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcriptio +n-activation domain All results in ascending count 19 sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcri +ption-activation domain 20 sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extr +usion sbmate Exclusion of various amino acids is as follows B=2 H=1 J=2 O=2 U=2 X=2 Z=2

    including Sequence length = 124 and Sequence length = 143. Is there a problem with your input data such as line endings that aren't as you expect?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      It is giving me the output as you mentioned. But the count is incorrect. So, if you look at the first entry, after removing the header, the sequence length (alphabets in uppercase) is 111, whereas the script gives me 115 as the output for sequence length. I don't know what I'm doing wrong, why is the script returning me wrong value?
Re^3: Count the sequence length of each entry in the file (updated)
by AnomalousMonk (Archbishop) on Oct 03, 2020 at 07:28 UTC
    Actually, the input file is ... [update: quoted from this post]

    Rather than saying (update: in a different post!) what your example data actually is, why not post that data as it actually is, in a form immediately usable by other monks (who are trying to help you). That is, use <code> ... </code> tags to post your data, maybe something like:

    <code> >sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcripti +on-activation domain VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS >sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusi +on sbmate HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS S </code>
    In addition to being your exact example data, this data is easily [download]-able by anyone who might want to help with this problem. You might want to add this <code>-tagged example data as an update (see How do I change/delete my post?) to your original post in addition to finally updating this post, or at the very least update this post and add to the OP a note saying that example data can be found there.

    Please see Markup in the Monastery, Writeup Formatting Tips and What shortcuts can I use for linking to other information?, and perhaps some of the articles in the Understanding and Using PerlMonks section of the Monastery's Tutorials.

    Update: Some trivial wording changes; added links to actual "Actually..." post.


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11122436]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-28 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found