in reply to Re^3: when to c, when to perl
in thread when to c, when to perl
Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?
Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.
Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.
My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?
MadraghRua
yet another biologist hacking perl....
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^5: when to c, when to perl
by TGI (Parson) on Jul 26, 2008 at 01:19 UTC | |
Re^5: when to c, when to perl
by tilly (Archbishop) on Jul 25, 2008 at 21:58 UTC |