Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
All,
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl use constant ONE_BYTE => 1; use constant FOUR_BYTE => 4; use constant BITS_PER_BYTE => 8; use constant BASES_PER_FOUR_BYTE => 16; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $header = parse_header($fh); my $count = $header->{CNT}; my %toc; populate_toc($fh, $count, \%toc); for my $name (keys %toc) { my $offset = $toc{$name}; my $dna = fetch_record($fh, $offset); #print length($dna), "\n"; print "$dna\n"; } sub parse_header { my ($fh) = @_; # Read header my $raw = ''; sysread($fh, $raw, FOUR_BYTE * 4); # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved}; } sub populate_toc { my ($fh, $count, $toc) = @_; my ($raw, $size, $name) = ('', '', ''); for (1 .. $count) { # Read size of record name sysread($fh, $raw, ONE_BYTE); $size = unpack('C', $raw); # Read name of reacord sysread($fh, $name, $size); # Read and store offset sysread($fh, $raw, FOUR_BYTE); $toc->{$name} = unpack('l', $raw); } } sub fetch_record { my ($fh, $offset) = @_; my ($raw, $dna, $size, $cnt) = ('', '', '', ''); my (@start, @len, %nblock, %mblock); # Seek to the record location sysseek($fh, $offset, 0); # Establish the conversion table my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); # Fetch the DNA size sysread($fh, $raw, FOUR_BYTE); $size = unpack('l', $raw); # Handle the n block and m blocks for my $block (\%nblock, \%mblock) { # Fetch the n block count sysread($fh, $raw, FOUR_BYTE); $cnt = unpack('l', $raw); if ($cnt) { sysread($fh, $raw, FOUR_BYTE * $cnt); @start = unpack("l$cnt", $raw); sysread($fh, $raw, FOUR_BYTE * $cnt); @len = unpack("l$cnt", $raw); @{$block}{@start} = @len; } } # throw away reserved field sysread($fh, $raw, FOUR_BYTE); # Fetch DNA - TODO: read in configurable size chunks my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE; sysread($fh, $raw, $bytes); $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +); # Fix N blocks substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock; # Fix M blocks substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f +or keys %mblock; return substr($dna, 0, $size); } __END__ See http://genome.ucsc.edu/FAQ/FAQformat#format7 See also human genome as .2bit ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit TODO 1. Documentation, error handling, and tests 2. Profiling and benchmarking 3. Handle big/little endian 4. Dynamically determine if record count justifies SQLite versus in-m +emory hash 5. Allow the reading of DNA sequence in configurable size chunks 6. Add an API to treat the sequence like an iterator

This leaves me with the following questions:

  • Is there an existing non-bundled module that does this?
  • Given that speed is a requirement, how would you improve it?
  • Is this worth polishing to put on CPAN?
  • If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R


In reply to Parsing .2bit DNA files by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-24 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found