|Pathologically Eclectic Rubbish Lister|
Re: fast+generic interface for range-based indexed retrievalby MadraghRua (Vicar)
|on Dec 11, 2008 at 18:18 UTC||Need Help??|
I have similar issues in tackling next generation sequencing technology outputs. Typically we're looking at short sequence reads in the range of 35 or so characters. Depending upon the technology, they are either predominantly letter based or number based. As a learning project I've been looking at repeat sequences from the human genome and trying to come up with an indexed set - the idea being I can simply remove these from the original reads and concentrate on working with non repeat sequence reads.
To date the best thing I've found is breaking the reads down based upon sequence complexity and producing several sorted indices - using BerkelyDB, DB_File or my own creations.You can also use Perl's hashes to sort keys in this fashion, which can be a useful tool. So far I've been finding preindexing is key, though I've yet to find a really satisfactory way to do it more efficiently in Perl
You might want to look into a new tool called Bowtie. It uses a Burrows-Wheeler index to index the reads and then provide fast look up to perform alignments. It is reported to have a really fast assembly time for genomic data.
Another alternative is to look at Genomatix GMBH - a bioinformatics company based in Germany. They also have a proprietary indexing scheme that permits fast sequence alignment. Unfortunately the algorithm is not published for this one, but their approach is tp preprocess the genome of interest into kmers ranging from 8mers to million mers and provide theri indexes with their software.
A final suggestion is check out Ewan Birney's Dynamite.
I've been finding that for smaller genome projects (< 5 reference sequence tags, each 35 base in length) that hashes in Perl work quite well. Perl sorts the hashes based upon key complexity, as far as I can see. If you have access to a server farm and a clustering software like Gluster and load sharing that you could simply distribute the analysis over many nodes and perform the analysis in parallel.
So sorry - no specific module recommendations but perhaps looking at Bowtie or Genomatix might spark some ideas for you.