Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:
The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.
The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.
Fortunately, the requester was able to provide some sample files and I put together the following code:
#!/usr/bin/perl use constant ONE_BYTE => 1; use constant FOUR_BYTE => 4; use constant BITS_PER_BYTE => 8; use constant BASES_PER_FOUR_BYTE => 16; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $header = parse_header($fh); my $count = $header->{CNT}; my %toc; populate_toc($fh, $count, \%toc); for my $name (keys %toc) { my $offset = $toc{$name}; my $dna = fetch_record($fh, $offset); #print length($dna), "\n"; print "$dna\n"; } sub parse_header { my ($fh) = @_; # Read header my $raw = ''; sysread($fh, $raw, FOUR_BYTE * 4); # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved}; } sub populate_toc { my ($fh, $count, $toc) = @_; my ($raw, $size, $name) = ('', '', ''); for (1 .. $count) { # Read size of record name sysread($fh, $raw, ONE_BYTE); $size = unpack('C', $raw); # Read name of reacord sysread($fh, $name, $size); # Read and store offset sysread($fh, $raw, FOUR_BYTE); $toc->{$name} = unpack('l', $raw); } } sub fetch_record { my ($fh, $offset) = @_; my ($raw, $dna, $size, $cnt) = ('', '', '', ''); my (@start, @len, %nblock, %mblock); # Seek to the record location sysseek($fh, $offset, 0); # Establish the conversion table my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); # Fetch the DNA size sysread($fh, $raw, FOUR_BYTE); $size = unpack('l', $raw); # Handle the n block and m blocks for my $block (\%nblock, \%mblock) { # Fetch the n block count sysread($fh, $raw, FOUR_BYTE); $cnt = unpack('l', $raw); if ($cnt) { sysread($fh, $raw, FOUR_BYTE * $cnt); @start = unpack("l$cnt", $raw); sysread($fh, $raw, FOUR_BYTE * $cnt); @len = unpack("l$cnt", $raw); @{$block}{@start} = @len; } } # throw away reserved field sysread($fh, $raw, FOUR_BYTE); # Fetch DNA - TODO: read in configurable size chunks my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE; sysread($fh, $raw, $bytes); $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +); # Fix N blocks substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock; # Fix M blocks substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f +or keys %mblock; return substr($dna, 0, $size); } __END__ See http://genome.ucsc.edu/FAQ/FAQformat#format7 See also human genome as .2bit ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit TODO 1. Documentation, error handling, and tests 2. Profiling and benchmarking 3. Handle big/little endian 4. Dynamically determine if record count justifies SQLite versus in-m +emory hash 5. Allow the reading of DNA sequence in configurable size chunks 6. Add an API to treat the sequence like an iterator
This leaves me with the following questions:
- Is there an existing non-bundled module that does this?
- Given that speed is a requirement, how would you improve it?
- Is this worth polishing to put on CPAN?
- If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?
Update: Added a few todo items in case this does make it to CPAN
Cheers - L~R
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing .2bit DNA files
by pc88mxer (Vicar) on Mar 05, 2008 at 21:10 UTC | |
by Limbic~Region (Chancellor) on Mar 05, 2008 at 23:23 UTC | |
Re: Parsing .2bit DNA files
by blokhead (Monsignor) on Mar 06, 2008 at 05:49 UTC | |
by bart (Canon) on Mar 06, 2008 at 11:48 UTC | |
Re: Parsing .2bit DNA files
by BrowserUk (Patriarch) on Mar 06, 2008 at 04:37 UTC | |
by Khisanth (Novice) on Mar 07, 2008 at 04:07 UTC | |
Re: Parsing .2bit DNA files
by bobf (Monsignor) on Mar 06, 2008 at 04:08 UTC | |
Re: Parsing .2bit DNA files
by bart (Canon) on Mar 06, 2008 at 21:39 UTC | |
Re: Parsing .2bit DNA files
by ikegami (Patriarch) on Mar 07, 2008 at 06:41 UTC | |
Re: Parsing .2bit DNA files
by Anonymous Monk on Oct 03, 2011 at 09:21 UTC | |
by Limbic~Region (Chancellor) on Oct 03, 2011 at 11:50 UTC |