Problems? Is your data what you think it is? | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
What does (undef) = scalar <>; do? With $/ = '>'; set, the first read will get just the very first '>' in the file--ie. the first character of the first line--which isn't useful, so the above just discards that. It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences It's all there available for whatever you want to do. This, which has a couple of minor changes from the code I benchmarked above, might fulfill your requirements. Though the filenames might be iffy, depending upon what's in the descriptions:
Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job? Honestly, I could never work it out. The whole thing is so overcomplicated--from memory it inherits from three (mostly unreleated) base classes, and then returns a object handle from a fourth class that might be any of a dozen other classes--it is neigh impossible to trace statically. The only way to know what code is actually invoked, would be to trace it through at runtime. No wonder no one dare try and fix it. My best guess is that the problems stem from two sources:
While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder. Presumably the "not" above is a typo :) If all you want is to split the file into lots of smaller files, there is no need to store everything in memory before writing it out again. And by doing so, you simply create a problem for the future when your next FASTA file is the full 3GB of the HG. For those occasions when you might want to revisit earlier sequences; or correlate between sequences; or process the sequences in some order other than that in which they appear in the file; then I have a simple tied hash implementation that retains just the offset/length pairs of the sequences read, so that it can quickly re-read individual sequences on demand without filling memory. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In reply to Re^3: Bioinformatics: Slow Parsing of a Fasta File
by BrowserUk
|
|