<?xml version="1.0" encoding="windows-1252"?>
<node id="672251" title="Parsing .2bit DNA files" created="2008-03-05 14:13:41" updated="2008-03-05 09:13:41">
<type id="115">
perlquestion</type>
<author id="180961">
Limbic~Region</author>
<data>
<field name="doctext">
All,
&lt;br /&gt;
Last night on #perl IRC (freenode), someone asked for help parsing [http://genome.ucsc.edu/FAQ/FAQformat#format7|.2bit files].  The usual responses followed:
&lt;ul&gt;
&lt;li&gt;Have you searched [http://search.cpan.org|CPAN]?&lt;/li&gt;
&lt;li&gt;Have you checked [http://bioperl.org|BioPerl]?&lt;/li&gt;
&lt;li&gt;Can't you just use [doc://unpack]?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
The response was that [http://search.cpan.org|CPAN] hadn't turned up anything, they weren't interested in the monolith that is [http://bioperl.org|BioPerl], and they were not that comfortable with [doc://unpack].  I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed.  This turned out to be more challenging than I expected due to lack of clarity in the [http://genome.ucsc.edu/FAQ/FAQformat#format7|spec].
&lt;/p&gt;
&lt;p&gt;
The issue is in the section covering sequence records.  One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block".  Right, and how is a single 32bit integer supposed to describe multiple items.  The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths.  If X is 0, the next fields are omitted entirely.
&lt;/p&gt;
&lt;p&gt;
Fortunately, the requester was able to provide some sample files and I put together the following code:
&lt;/p&gt;
&lt;readmore&gt;
&lt;c&gt;
#!/usr/bin/perl
use constant            ONE_BYTE =&gt; 1;
use constant           FOUR_BYTE =&gt; 4;
use constant       BITS_PER_BYTE =&gt; 8;
use constant BASES_PER_FOUR_BYTE =&gt; 16;
use strict;
use warnings;

my $file = $ARGV[0] or die "Usage: $0 &lt;input&gt;";
open(my $fh, '&lt;', $file) or die "Unable to open '$file' for reading: $!";

my $header = parse_header($fh);
my $count  = $header-&gt;{CNT};

my %toc;
populate_toc($fh, $count, \%toc);

for my $name (keys %toc) {
    my $offset = $toc{$name};
    my $dna = fetch_record($fh, $offset);
    #print length($dna), "\n";
    print "$dna\n";
}

sub parse_header {
    my ($fh) = @_;

    # Read header
    my $raw = '';
    sysread($fh, $raw, FOUR_BYTE * 4);

    # Parse header
    my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw);

    # TODO: validate (signature, reverse byte order, version)
    return {SIG =&gt; $sig, VER =&gt; $ver, CNT =&gt; $cnt, RSV =&gt; $reserved};
}

sub populate_toc {
    my ($fh, $count, $toc) = @_;

    my ($raw, $size, $name) = ('', '', '');
    for (1 .. $count) {

        # Read size of record name
        sysread($fh, $raw, ONE_BYTE);
        $size = unpack('C', $raw);

        # Read name of reacord
        sysread($fh, $name, $size);

        # Read and store offset
        sysread($fh, $raw, FOUR_BYTE);
        $toc-&gt;{$name} = unpack('l', $raw);
    }
}

sub fetch_record {
    my ($fh, $offset) = @_;

    my ($raw, $dna, $size, $cnt) = ('', '', '', '');
    my (@start, @len, %nblock, %mblock);

    # Seek to the record location
    sysseek($fh, $offset, 0);

    # Establish the conversion table
    my %conv = ('00' =&gt; 'T', '01' =&gt; 'C', '10' =&gt; 'A', '11' =&gt; 'G');

    # Fetch the DNA size
    sysread($fh, $raw, FOUR_BYTE);
    $size = unpack('l', $raw);

    # Handle the n block and m blocks
    for my $block (\%nblock, \%mblock) {

        # Fetch the n block count
        sysread($fh, $raw, FOUR_BYTE);
        $cnt = unpack('l', $raw);

        if ($cnt) {
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @start = unpack("l$cnt", $raw);
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @len = unpack("l$cnt", $raw);
            @{$block}{@start} = @len;
        }
    }

    # throw away reserved field
    sysread($fh, $raw, FOUR_BYTE);

    # Fetch DNA - TODO:  read in configurable size chunks
    my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE;
    sysread($fh, $raw, $bytes);
    $dna = join '', 
           map $conv{$_},
           unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw));

    # Fix N blocks
    substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock;

    # Fix M blocks
    substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) for keys %mblock;

    return substr($dna, 0, $size);
}
__END__
See http://genome.ucsc.edu/FAQ/FAQformat#format7
See also human genome as .2bit
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit

TODO
1.  Documentation, error handling, and tests
2.  Profiling and benchmarking
3.  Handle big/little endian
4.  Dynamically determine if record count justifies SQLite versus in-memory hash
5.  Allow the reading of DNA sequence in configurable size chunks
6.  Add an API to treat the sequence like an iterator
&lt;/c&gt;
&lt;/readmore&gt;
&lt;p&gt;
This leaves me with the following questions:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Is there an existing non-bundled module that does this?&lt;/li&gt;
&lt;li&gt;Given that speed is a requirement, how would you improve it?&lt;/li&gt;
&lt;li&gt;Is this worth polishing to put on CPAN?&lt;/li&gt;
&lt;li&gt;If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;b&gt;Update:&lt;/b&gt; Added a few todo items in case this does make it to CPAN
&lt;/p&gt;
&lt;div class="pmsig"&gt;&lt;div class="pmsig-180961"&gt;
&lt;p&gt;
Cheers - [Limbic~Region|L~R]
&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;</field>
</data>
</node>
