Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl use constant ONE_BYTE => 1; use constant FOUR_BYTE => 4; use constant BITS_PER_BYTE => 8; use constant BASES_PER_FOUR_BYTE => 16; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $header = parse_header($fh); my $count = $header->{CNT}; my %toc; populate_toc($fh, $count, \%toc); for my $name (keys %toc) { my $offset = $toc{$name}; my $dna = fetch_record($fh, $offset); #print length($dna), "\n"; print "$dna\n"; } sub parse_header { my ($fh) = @_; # Read header my $raw = ''; sysread($fh, $raw, FOUR_BYTE * 4); # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved}; } sub populate_toc { my ($fh, $count, $toc) = @_; my ($raw, $size, $name) = ('', '', ''); for (1 .. $count) { # Read size of record name sysread($fh, $raw, ONE_BYTE); $size = unpack('C', $raw); # Read name of reacord sysread($fh, $name, $size); # Read and store offset sysread($fh, $raw, FOUR_BYTE); $toc->{$name} = unpack('l', $raw); } } sub fetch_record { my ($fh, $offset) = @_; my ($raw, $dna, $size, $cnt) = ('', '', '', ''); my (@start, @len, %nblock, %mblock); # Seek to the record location sysseek($fh, $offset, 0); # Establish the conversion table my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); # Fetch the DNA size sysread($fh, $raw, FOUR_BYTE); $size = unpack('l', $raw); # Handle the n block and m blocks for my $block (\%nblock, \%mblock) { # Fetch the n block count sysread($fh, $raw, FOUR_BYTE); $cnt = unpack('l', $raw); if ($cnt) { sysread($fh, $raw, FOUR_BYTE * $cnt); @start = unpack("l$cnt", $raw); sysread($fh, $raw, FOUR_BYTE * $cnt); @len = unpack("l$cnt", $raw); @{$block}{@start} = @len; } } # throw away reserved field sysread($fh, $raw, FOUR_BYTE); # Fetch DNA - TODO: read in configurable size chunks my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE; sysread($fh, $raw, $bytes); $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +); # Fix N blocks substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock; # Fix M blocks substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f +or keys %mblock; return substr($dna, 0, $size); } __END__ See See also human genome as .2bit TODO 1. Documentation, error handling, and tests 2. Profiling and benchmarking 3. Handle big/little endian 4. Dynamically determine if record count justifies SQLite versus in-m +emory hash 5. Allow the reading of DNA sequence in configurable size chunks 6. Add an API to treat the sequence like an iterator

This leaves me with the following questions:

  • Is there an existing non-bundled module that does this?
  • Given that speed is a requirement, how would you improve it?
  • Is this worth polishing to put on CPAN?
  • If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R

In reply to Parsing .2bit DNA files by Limbic~Region

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (4)
    As of 2018-04-21 22:15 GMT
    Find Nodes?
      Voting Booth?