Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:

  1. Using 'l' template, which is for signed 32-bit numbers could cause problems.

    You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'.

  2. My interpretation on the spec you linked is that the entire file may need byte order reversal?

    If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on.

    My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later.

The slowest bit of the process seems likely to be

$dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +);

You should be able to save a bit of time by building a larger lookup table:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;

Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:

## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C*', $raw;

Also, build your lookups at compile time not over and over at runtime as now.

Of course, you can take that idea a little further and do two bytes at a time:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ | 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n*', $raw;

Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.

Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.

Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.

I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.

Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Parsing .2bit DNA files by BrowserUk
in thread Parsing .2bit DNA files by Limbic~Region

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (3)
    As of 2018-04-27 05:25 GMT
    Find Nodes?
      Voting Booth?