Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Bioinformatics: Slow Parsing of a Fasta File

by Anonymous Monk
on Jul 27, 2010 at 18:20 UTC ( #851580=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am encountering a very irritating situation with a group of DNA strings saved in a special file format with the FastA extension. While this format is extremely simple and its files can readily be opened by ol' NotePad I still find that my program sort of takes a very long time to complete.

What I am trying to do is open the FastA file that has these DNA sequences, read the descriptor from the headers for each sequence and then use the descriptor as a new filename and output the sequences to these files accordingly in preparation for further analyses, (N.B that can be taken care of in due time).

#An example of my data file 'Test.Fasta', which has 3 records starting + at '>' >gi|62750809 TGAGCATGGGAGATCTTTCCATCTTCTGAGGTCTTCTTCAATTTCTTTCCTCAGTGTCTTGAAGTTCTTA TTGTACAGATCTTTTACTTGCTTGGTTAATGTCACACCGAGGTATTTTATATTATTTGGGTCTATTATGA >gi|151301097 TTTCTGGCTCCGCGGGCAGCGGGGCCGTGGCGCTCGGACGGTCTGGGATTCGGGCGCCGCCGCGGAACCG GAATAAGAAGGGAGAGCGCCCGGCTCGGTCCTCGGTCTCCACCGCGGCCCGGAAGGAATCCGGGCAGCCT >gi|25266387 TGTGTATGTATATTAATTACATTCATATGTATTCACAACACCTGCCTCAAATCAGGCAGAATGGTCCAGG ATGGAATTAGGGGCAAGCATGAGGTCTTCAGGCTTACTGATTTCTAAGACACAGTAACTTCACTGGTTAG
of Course each one of these records is very big that it spans a lot of lines and there are many records that the entire file is >200,000 KBs. I wrote the following program in BioPerl and just to get the program to print the descriptors after '>' takes more than a half hour whereas even if I turned flushing on while writing to a file the program takes forever again so monks tell me what can be a potential cause of this, and what measures do you implement when dealing with huge files. I am running on an HP with 2 GB of RAM and a duo core processor ...

Here's my code (scaled down to represent the problem)

#!/usr/local/bin/perl use strict; use warnings; use Bio::SeqIO; my $file = "Test.fasta"; my $in = Bio::SeqIO->new( -file => "<$file", -format => 'fasta', -flush=>0 ); while (my $seq=$in->next_seq){ my $desc = $seq->desc; print $desc, "\n" ; }

Comment on Bioinformatics: Slow Parsing of a Fasta File
Select or Download Code
Re: Bioinformatics: Slow Parsing of a Fasta File
by almut (Canon) on Jul 27, 2010 at 19:25 UTC

    I've never used any BioPerl modules myself (so I cannot say anything about their performance in general), but half an hour for reading ~200MB sounds extremely long indeed.  That makes me wonder if the file is maybe not stored locally, i.e. you're doing I/O with some remote share...

    Could you check if running something functionally similar, but without Bio::SeqIO — in its most simple case

    while (<>) { print if /^>/; }

    (or  perl -ne "print if /^>/" Test.fasta  as a one-liner)  takes similarly long?

      The code perl -ne "print if /^>/" test.fasta works just quite quickly.

      I am reading this file from the same folder that the program is in... While not wanting to resort to a regex to extract the sequences and their header information I believe this is going to be my only way through!

Re: Bioinformatics: Slow Parsing of a Fasta File
by erix (Vicar) on Jul 27, 2010 at 21:43 UTC

    Bioperl has considerable overhead, but your numbers are extreme. Maybe you run an old bioperl? ( perl -MBio::SeqIO -e 'print $Bio::SeqIO::VERSION, "\n";' )

    I copied your sequences to a 200+ MB file, and tested with and without bioperl, and with a desktop vs server. The slowest I got was still within 5 minutes. The faster machine ran your program in just over 1 minute, the straight perl version in 2s.

    (reader.pl is basically your code above).

    ############## # # 5-year old desktop machine # $ ls -lh Test.fasta -rw-r--r-- 1 aardvark aardvark 234M Jul 27 23:02 Test.fasta $ time perl -ne "print if /^>/" Test.fasta | wc -l 1572864 real 0m5.709s user 0m4.279s sys 0m1.233s $ time perl ./reader.pl | wc -l 1572864 real 4m26.494s user 4m8.936s sys 0m1.607s ############## # # server # $ ls -lh Test.fasta -rw-rw-r-- 1 aardvark aardvark 234M Jul 27 22:51 Test.fasta $ time perl -ne "print if /^>/" Test.fasta | wc -l 1572864 real 0m1.774s user 0m1.607s sys 0m0.214s $ time perl ./reader.pl | wc -l 1572864 real 1m9.258s user 1m9.012s sys 0m0.229s
      The version that I have is 1.006001, how old is that ? and if its rusting old what is the best way to make an update?
      I am testing that right on the laptop itself so there's no issue of a +network hindrance of any sort.
Re: Bioinformatics: Slow Parsing of a Fasta File
by BrowserUk (Pope) on Jul 27, 2010 at 22:10 UTC

    Be aware. Bio::SeqIO is ludicrously slow! And has been known to be so for a long time.

    That's the trouble with O'Woe frameworks. Everything gets buried so deep in a dark, twisty mess of unnecessary subclasses and overzealous overrides, that even when the limitations are obvious and horribly detrimental, no one can see their way through to correcting the problem.

    By way of contrast, run against a 200MB, 1,058,202 140-char sequence fasta file, the following runs in just 11 seconds:

    #! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; my $seq = join'', @lines; print $desc; } printf STDERR "Took %d seconds\n", time() - $start; __END__ c:\test>fasta test.fasta >nul Took 11 seconds c:\test>dir test.fasta 27/07/2010 22:40 201,116,583 test.fasta c:\test>tail test.fasta CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058200: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058201: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058202: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA

    As your sequences are much larger, I ran it against the 163MB, 929 sequence (ave:175k/seq) na_clones.dros.RELEASE2.5 file I have kicking around. It took a whole 7 seconds:

    c:\test>fasta \dell\test\fasta\na_clones.dros.RELEASE2.5 BACH50G05 : AC011761, 108350 bases, from X:19. BACH57F14 : AC018478, 103809 bases, from 4:101. BACH59K20 : AC010840, 29516 bases, from 4:101. BACN19N21 : AC010839, 91789 bases, from 4:101. ... BACR48O22 : AC104149, 193714 bases, from X:01. BACR48O23 : AC009888, 168719 bases, from 3R:99. BACR48O24 : AC023722, 191590 bases, from X:01. BACR48P17 : AC012165, 176195 bases, from X:18. BACR49A05 : AC008194, 181438 bases, from X:18. Took 7 seconds

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      What does (undef) = scalar <>; do?

      It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences

      Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job?

      While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder.

        Throws away the first line read from magic null filehandle
        What does (undef) = scalar <>; do?

        With $/ = '>'; set, the first read will get just the very first '>' in the file--ie. the first character of the first line--which isn't useful, so the above just discards that.

        It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences

        It's all there available for whatever you want to do.

        This, which has a couple of minor changes from the code I benchmarked above, might fulfill your requirements. Though the filenames might be iffy, depending upon what's in the descriptions:

        #! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; ## This is the description my $seq = join "\n", @lines; ## This is the sequence. open my $out, '>', $desc . 'fasta' or warn "$desc.fasta : $!" and +next; print $out ">$desc\n", $seq; } printf STDERR "Took %d seconds\n", time() - $start;
        Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job?

        Honestly, I could never work it out. The whole thing is so overcomplicated--from memory it inherits from three (mostly unreleated) base classes, and then returns a object handle from a fourth class that might be any of a dozen other classes--it is neigh impossible to trace statically. The only way to know what code is actually invoked, would be to trace it through at runtime. No wonder no one dare try and fix it.

        My best guess is that the problems stem from two sources:

        • Every method call traversing through half-a-dozen super-classes that do nothing but laboriously and redundantly, check and re-check the same parameters values at each level on the way in; and do the same thing for the return values on the way out.
        • I don't know for sure, as I never managed to get it to install here so I could trace it through at runtime, but the symptoms of the problems that I read are consistent with it creating and retaining (possibly multiple) copies of every sequence in memory.

          The code above only ever has one description and one sequence in memory at a time, so memory usage will never be a problem.

          Unless you have a single sequence that is bigger than your virual memory, in which case you'd be stuffed anyway.

        While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder.

        Presumably the "not" above is a typo :)

        If all you want is to split the file into lots of smaller files, there is no need to store everything in memory before writing it out again. And by doing so, you simply create a problem for the future when your next FASTA file is the full 3GB of the HG.

        For those occasions when you might want to revisit earlier sequences; or correlate between sequences; or process the sequences in some order other than that in which they appear in the file; then I have a simple tied hash implementation that retains just the offset/length pairs of the sequences read, so that it can quickly re-read individual sequences on demand without filling memory.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      ">" characters are allowed in fasta description header lines. If there are ">" in the description, they will cause errors for the fasta entry.

        I admit, I didn't think that that was legal. But anyway, the fix is quite trivial and has no affect upon performance:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; my %sequences; local $/ = '>'; (undef) = scalar <DATA>; ## Discard first delimiter local $/ = "\n>"; while( my $record = <DATA> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $id = shift @lines; $sequences{ $id } = join'', @lines; } pp \%sequences; __DATA__ >uc002yje.1 > chr21:13973492-13974491 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.2 > chr21:13974492-13975432 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.3 > chr21:13975431-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG

        Produces:

        C:\test>fasta { "uc002yje.1 > chr21:13973492-13974491" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagc +cctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.2 > chr21:13974492-13975432" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaaaaaacccggccccgca +gagccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.3 > chr21:13975431-13976330" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagg +gccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Link above now corrected. Instead of 604932 I apparently typed 64932.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://851580]
Approved by AnomalousMonk
Front-paged by biohisham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-08-23 20:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (178 votes), past polls