Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl use constant ONE_BYTE => 1; use constant FOUR_BYTE => 4; use constant BITS_PER_BYTE => 8; use constant BASES_PER_FOUR_BYTE => 16; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $header = parse_header($fh); my $count = $header->{CNT}; my %toc; populate_toc($fh, $count, \%toc); for my $name (keys %toc) { my $offset = $toc{$name}; my $dna = fetch_record($fh, $offset); #print length($dna), "\n"; print "$dna\n"; } sub parse_header { my ($fh) = @_; # Read header my $raw = ''; sysread($fh, $raw, FOUR_BYTE * 4); # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved}; } sub populate_toc { my ($fh, $count, $toc) = @_; my ($raw, $size, $name) = ('', '', ''); for (1 .. $count) { # Read size of record name sysread($fh, $raw, ONE_BYTE); $size = unpack('C', $raw); # Read name of reacord sysread($fh, $name, $size); # Read and store offset sysread($fh, $raw, FOUR_BYTE); $toc->{$name} = unpack('l', $raw); } } sub fetch_record { my ($fh, $offset) = @_; my ($raw, $dna, $size, $cnt) = ('', '', '', ''); my (@start, @len, %nblock, %mblock); # Seek to the record location sysseek($fh, $offset, 0); # Establish the conversion table my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); # Fetch the DNA size sysread($fh, $raw, FOUR_BYTE); $size = unpack('l', $raw); # Handle the n block and m blocks for my $block (\%nblock, \%mblock) { # Fetch the n block count sysread($fh, $raw, FOUR_BYTE); $cnt = unpack('l', $raw); if ($cnt) { sysread($fh, $raw, FOUR_BYTE * $cnt); @start = unpack("l$cnt", $raw); sysread($fh, $raw, FOUR_BYTE * $cnt); @len = unpack("l$cnt", $raw); @{$block}{@start} = @len; } } # throw away reserved field sysread($fh, $raw, FOUR_BYTE); # Fetch DNA - TODO: read in configurable size chunks my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE; sysread($fh, $raw, $bytes); $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +); # Fix N blocks substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock; # Fix M blocks substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f +or keys %mblock; return substr($dna, 0, $size); } __END__ See See also human genome as .2bit TODO 1. Documentation, error handling, and tests 2. Profiling and benchmarking 3. Handle big/little endian 4. Dynamically determine if record count justifies SQLite versus in-m +emory hash 5. Allow the reading of DNA sequence in configurable size chunks 6. Add an API to treat the sequence like an iterator

This leaves me with the following questions:

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R