Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Parsing .2bit DNA files

by BrowserUk (Patriarch)
on Mar 06, 2008 at 04:37 UTC ( [id://672356]=note: print w/replies, xml ) Need Help??

in reply to Parsing .2bit DNA files

The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:

  1. Using 'l' template, which is for signed 32-bit numbers could cause problems.

    You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'.

  2. My interpretation on the spec you linked is that the entire file may need byte order reversal?

    If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on.

    My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later.

The slowest bit of the process seems likely to be

$dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +);

You should be able to save a bit of time by building a larger lookup table:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;

Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:

## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C*', $raw;

Also, build your lookups at compile time not over and over at runtime as now.

Of course, you can take that idea a little further and do two bytes at a time:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ | 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n*', $raw;

Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.

Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.

Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.

I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.

Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Parsing .2bit DNA files
by Khisanth (Novice) on Mar 07, 2008 at 04:07 UTC is much smaller, 24.6MB

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://672356]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-06-18 07:05 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.