Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:

  1. Using 'l' template, which is for signed 32-bit numbers could cause problems.

    You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'.

  2. My interpretation on the spec you linked is that the entire file may need byte order reversal?

    If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on.

    My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later.

The slowest bit of the process seems likely to be

$dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +);

You should be able to save a bit of time by building a larger lookup table:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;

Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:

## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C*', $raw;

Also, build your lookups at compile time not over and over at runtime as now.

Of course, you can take that idea a little further and do two bytes at a time:

my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ | 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n*', $raw;

Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.

Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.

Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.

I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.

Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Parsing .2bit DNA files by BrowserUk
in thread Parsing .2bit DNA files by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-06-18 17:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.