comment on

What does (undef) = scalar <>; do?

With $/ = '>'; set, the first read will get just the very first '>' in the file--ie. the first character of the first line--which isn't useful, so the above just discards that.

It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences

It's all there available for whatever you want to do.

This, which has a couple of minor changes from the code I benchmarked above, might fulfill your requirements. Though the filenames might be iffy, depending upon what's in the descriptions:

#! perl -slw
use strict;
use Data::Dumper;

local $/ = '>';

my @sequences;

(undef) = scalar <>;

my $start = time;
while( my $record = <> ) {
    my @lines = split "\n", $record;
    pop @lines if $lines[-1] eq '>';
    my $desc  = shift @lines; ## This is the description
    my $seq = join "\n", @lines; ## This is the sequence.

    open my $out, '>', $desc . 'fasta' or warn "$desc.fasta : $!" and 
+next;
    print $out ">$desc\n", $seq;
}
printf STDERR "Took %d seconds\n", time() - $start;
[download]

Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job?

Honestly, I could never work it out. The whole thing is so overcomplicated--from memory it inherits from three (mostly unreleated) base classes, and then returns a object handle from a fourth class that might be any of a dozen other classes--it is neigh impossible to trace statically. The only way to know what code is actually invoked, would be to trace it through at runtime. No wonder no one dare try and fix it.

My best guess is that the problems stem from two sources:

Every method call traversing through half-a-dozen super-classes that do nothing but laboriously and redundantly, check and re-check the same parameters values at each level on the way in; and do the same thing for the return values on the way out.
I don't know for sure, as I never managed to get it to install here so I could trace it through at runtime, but the symptoms of the problems that I read are consistent with it creating and retaining (possibly multiple) copies of every sequence in memory.
The code above only ever has one description and one sequence in memory at a time, so memory usage will never be a problem.
Unless you have a single sequence that is bigger than your virual memory, in which case you'd be stuffed anyway.

While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder.

Presumably the "not" above is a typo :)

If all you want is to split the file into lots of smaller files, there is no need to store everything in memory before writing it out again. And by doing so, you simply create a problem for the future when your next FASTA file is the full 3GB of the HG.

For those occasions when you might want to revisit earlier sequences; or correlate between sequences; or process the sequences in some order other than that in which they appear in the file; then I have a simple tied hash implementation that retains just the offset/length pairs of the sequences read, so that it can quickly re-read individual sequences on demand without filling memory.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

In reply to Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk
in thread Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.