Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Size of sequences in fastafile

by Sofie (Acolyte)
on Feb 29, 2020 at 13:07 UTC ( #11113575=perlquestion: print w/replies, xml ) Need Help??

Sofie has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a fasta file containing a number of sequences. I would like to extract only the sequences that are of a specific length (lets's say >10 nt). The sequences are wrapped with new lines, so when trying to read into an array, each line becomes an element. How can I removed these? I have tried join, but it also removes the newline between sequences and puts all of it into single string. How do I separate each sequence into an element or string?

The file looks something like this:

Output should be only the second sequence in this case.
#!/usr/bin/perl -w #open the fastfile Genes.fasta open (GENES, "Genes.fasta") or die "Could not open file"; chomp (@seq = <GENES>); $seq = join ("\n", @seq); $lengthseq = length $seq; #min length of the seq $minlength = 10; #if length is over a certain size, print if ($lengthseq > $minlength){ print $seq; } else {print "No sequence is over $minlength;" }

Replies are listed 'Best First'.
Re: Size of sequences in fastafile
by zubenel0 (Sexton) on Mar 01, 2020 at 07:18 UTC
    Tasks like this can be solved by using BioPerl.
    use strict; use warnings; use Bio::SeqIO; # Reading the fasta file my $seqio_obj = Bio::SeqIO->new(-file => "Genes.fasta", -format => "fasta" ); # Looping through sequences and printing them unless length <= 10 + while ( my $seq_obj = $seqio_obj->next_seq ) { print $seq_obj->seq,"\n" unless $seq_obj->length <= 10; } # The output should be : "GGAGGTCTTTAGCTTTAGGGAAACCC".
Re: Size of sequences in fastafile
by hippo (Bishop) on Feb 29, 2020 at 14:08 UTC

    It is difficult to know quite what data structure and form you want from the description (See I know what I mean. Why don't you?). I've assumed you want an array formed with the even lines appended to the odd lines with a space between. Here's an SSCCE to demonstrate one technique for this.

    use strict; use warnings; use Test::More tests => 1; my @seq; my @want = ( '>NM_001 Homo sapiens ADA2 (CECR1) GATCCAA', '>NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC', ); while (<DATA>) { chomp; if (/^>/) { push @seq, "$_ "; } else { $seq[-1] .= $_; } } is_deeply \@seq, \@want; __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC

    See also How to ask better questions using Test::More and sample data. HTH.

      Thanks for the reply. But the code was not what I was looking for. Basically I just want to figure out the length of each sequence in a multifasta file. The files contains many sequences, each sequence has new lines inside, and I want to know how many nucleotides each sequence is and then extract only the sequences with more than x number of nt. Not sure if that makes sense? How to figure out the length of a sequence from a multifasta sequence file? Thanks!

        It's still not very clear precisely what you expect but here's my final stab at this.

        use strict; use warnings; use Test::More tests => 1; # Present this SSCCE as a test my @seq; # Sequences longer than $min are stored here my @want = ( 'GGAGGTCTTTAGCTTTAGGGAAACCC', ); # These are the sequeneces we expect for the given data set my $min = 15; # Minimum length of any sequence to be considered my $this = ''; while (<DATA>) { chomp; if (/^>/) { push @seq, $this if $min <= length $this; $this = ''; } else { $this .= $_; } } push @seq, $this if $min <= length $this; is_deeply \@seq, \@want; # Check that our algorithm has worked __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC
Re: Size of sequences in fastafile
by BillKSmith (Monsignor) on Feb 29, 2020 at 20:23 UTC
    I searched CPAN for 'fasta' and found several modules for extracting sequences. You probably can find one that meets your needs.
Re: Size of sequences in fastafile
by kcott (Archbishop) on Mar 02, 2020 at 08:33 UTC

    G'day Sofie,

    I normally handle fasta files by making the record terminator the '>'; note that the first record is a zero-length string which you'll typically want to discard. In your particular case here, you can split that record on a newline: the sequence will be the second element. Look at the code below and then some explanatory notes at the end.

    #!/usr/bin/env perl use strict; use warnings; use constant MIN_LENGTH => 10; my $long_seqs = 0; { local $/ = '>'; while (<DATA>) { next if $. == 1; my $seq = (split /\n/)[1]; next if MIN_LENGTH > length $seq; ++$long_seqs; print $seq, "\n"; } } if (not $long_seqs) { print 'No sequence is over ', MIN_LENGTH-1, "\n"; } __DATA__ >NM_001 Homo sapiens ADA2 (CECR1) GATCCAA >NM_002 Homo sapiens IKBKG GGAGGTCTTTAGCTTTAGGGAAACCC > sequence length = 9 -- should be discarded ABCDEFGHI > sequence length = 10 -- should be kept ABCDEFGHIJ > sequence length = 11 -- should be kept ABCDEFGHIJK

    Sample run:


    Changing the MIN_LENGTH constant to 100:

    $ ./ No sequence is over 99


    • See "perlvar: Variables related to filehandles" for an explanation of '$/', '$.', and why I've put the local and while in an anonymous block.
    • See "local" for more about that function; that page has links to further information.
    • Purely for demonstration purposes, I've put the data under __DATA__ and read it with the DATA filehandle (see "perldata: Special Literals" for more about those). Your code will have something more like '(<$fh>)'.
    • See "open" for a better way to open your files: using lexical filehandles and the 3-argument form of that command.
    • Your error reporting leaves much to be desired: which file couldn't open? why couldn't it open? It might be fine for a tiny test script like this; however, that inevitably gets expanded and built upon — then you can end up with bugs that are hard to track down (exacerbated by the use of package variables for filehandles). Take a look at the "autodie" pragma: with this you don't have to (remember to) write 'or die "..."' (no point doing extra work when Perl can do it for you) and you also get much better reporting of which file had what problem.
    • Note that I extended your test data — yes, I know those aren't valid sequences but they were handier for quickly checking results than random strings of nucleotides. Off-by-one errors are extremely common and very easy to make: you actually have one in the code you posted ("... No sequence is over ...") — compare with my code to spot the problem. In this case, using test (pseudo-)sequences that are 'MIN_LENGTH-1', 'MIN_LENGTH' and 'MIN_LENGTH+1', you can quickly catch problems like that.
    • Don't use the '-w' switch: see "perlrun: -w" for the reason. Always use the strict and warnings pragmata, as I've shown in my code: see "perlintro: Safety net" for the reason.

    — Ken

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11113575]
Approved by herveus
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2023-05-31 11:07 GMT
Find Nodes?
    Voting Booth?

    No recent polls found