http://www.perlmonks.org?node_id=700220


in reply to Re^3: when to c, when to perl
in thread when to c, when to perl

Could you please elaborate on why adding berkelyDB to the mix would be worse than Dbfile? Let me give a for instance. I have a bacterial genome of 5 million bases. I want to break this up into kmers of various sizes. I need to pay attention to both DNA strands, so I record the orientation in which I see the kmer. For each kmer I want to see if it has already been seen. If so, increment the number of times the kmer was seen, record where it was seen, record the orientation of the kmer. Now go through the kmers and record which ones have low sequence comlexity - lots of repeats or other characterisitcs that might make identifying overlapping kmers difficult, for instance. So now I have a number of different hashes, usually keyed to the kmer sequence which I will use in the next part of my project. I will probabaly also sort all the kmers in my hash/database to speed up the search process in the next steps. I amy even precompute a series of such kmer databases ahead of time for different sizes, simply to help with processing the data.

Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?

Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.

Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.

My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?

MadraghRua
yet another biologist hacking perl....