http://www.perlmonks.org?node_id=672356


in reply to Parsing .2bit DNA files

The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:

  1. Using 'l' template, which is for signed 32-bit numbers could cause problems.

    You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'.

  2. My interpretation on the spec you linked is that the entire file may need byte order reversal?

    If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on.

    My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later.

The slowest bit of the process seems likely to be

$dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +);

You should be able to save a bit of time by building a larger lookup table:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;

Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:

## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C*', $raw;

Also, build your lookups at compile time not over and over at runtime as now.

Of course, you can take that idea a little further and do two bytes at a time:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ | 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n*', $raw;

Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.

Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.

Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.

I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.

Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Parsing .2bit DNA files
by Khisanth (Novice) on Mar 07, 2008 at 04:07 UTC
    ftp://hgdownload.cse.ucsc.edu/gbdb/ce4/ce4.2bit is much smaller, 24.6MB