|Pathologically Eclectic Rubbish Lister|
Re: Parsing .2bit DNA filesby BrowserUk (Pope)
|on Mar 06, 2008 at 04:37 UTC||Need Help??|
The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:
The slowest bit of the process seems likely to be
You should be able to save a bit of time by building a larger lookup table:
Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:
Also, build your lookups at compile time not over and over at runtime as now.
Of course, you can take that idea a little further and do two bytes at a time:
Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.
Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.
Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.
I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.
Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.