comment on

All,
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

Have you searched CPAN?
Have you checked BioPerl?
Can't you just use unpack?

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl
use constant            ONE_BYTE => 1;
use constant           FOUR_BYTE => 4;
use constant       BITS_PER_BYTE => 8;
use constant BASES_PER_FOUR_BYTE => 16;
use strict;
use warnings;

my $file = $ARGV[0] or die "Usage: $0 <input>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";

my $header = parse_header($fh);
my $count  = $header->{CNT};

my %toc;
populate_toc($fh, $count, \%toc);

for my $name (keys %toc) {
    my $offset = $toc{$name};
    my $dna = fetch_record($fh, $offset);
    #print length($dna), "\n";
    print "$dna\n";
}

sub parse_header {
    my ($fh) = @_;

    # Read header
    my $raw = '';
    sysread($fh, $raw, FOUR_BYTE * 4);

    # Parse header
    my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw);

    # TODO: validate (signature, reverse byte order, version)
    return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved};
}

sub populate_toc {
    my ($fh, $count, $toc) = @_;

    my ($raw, $size, $name) = ('', '', '');
    for (1 .. $count) {

        # Read size of record name
        sysread($fh, $raw, ONE_BYTE);
        $size = unpack('C', $raw);

        # Read name of reacord
        sysread($fh, $name, $size);

        # Read and store offset
        sysread($fh, $raw, FOUR_BYTE);
        $toc->{$name} = unpack('l', $raw);
    }
}

sub fetch_record {
    my ($fh, $offset) = @_;

    my ($raw, $dna, $size, $cnt) = ('', '', '', '');
    my (@start, @len, %nblock, %mblock);

    # Seek to the record location
    sysseek($fh, $offset, 0);

    # Establish the conversion table
    my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G');

    # Fetch the DNA size
    sysread($fh, $raw, FOUR_BYTE);
    $size = unpack('l', $raw);

    # Handle the n block and m blocks
    for my $block (\%nblock, \%mblock) {

        # Fetch the n block count
        sysread($fh, $raw, FOUR_BYTE);
        $cnt = unpack('l', $raw);

        if ($cnt) {
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @start = unpack("l$cnt", $raw);
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @len = unpack("l$cnt", $raw);
            @{$block}{@start} = @len;
        }
    }

    # throw away reserved field
    sysread($fh, $raw, FOUR_BYTE);

    # Fetch DNA - TODO:  read in configurable size chunks
    my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE;
    sysread($fh, $raw, $bytes);
    $dna = join '', 
           map $conv{$_},
           unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw)
+);

    # Fix N blocks
    substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock;

    # Fix M blocks
    substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f
+or keys %mblock;

    return substr($dna, 0, $size);
}
__END__
See http://genome.ucsc.edu/FAQ/FAQformat#format7
See also human genome as .2bit
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit

TODO
1.  Documentation, error handling, and tests
2.  Profiling and benchmarking
3.  Handle big/little endian
4.  Dynamically determine if record count justifies SQLite versus in-m
+emory hash
5.  Allow the reading of DNA sequence in configurable size chunks
6.  Add an API to treat the sequence like an iterator
[download]

This leaves me with the following questions:

Is there an existing non-bundled module that does this?
Given that speed is a requirement, how would you improve it?
Is this worth polishing to put on CPAN?
If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R

In reply to Parsing .2bit DNA files by Limbic~Region

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks