http://www.perlmonks.org?node_id=700220


in reply to Re^3: when to c, when to perl
in thread when to c, when to perl

Could you please elaborate on why adding berkelyDB to the mix would be worse than Dbfile? Let me give a for instance. I have a bacterial genome of 5 million bases. I want to break this up into kmers of various sizes. I need to pay attention to both DNA strands, so I record the orientation in which I see the kmer. For each kmer I want to see if it has already been seen. If so, increment the number of times the kmer was seen, record where it was seen, record the orientation of the kmer. Now go through the kmers and record which ones have low sequence comlexity - lots of repeats or other characterisitcs that might make identifying overlapping kmers difficult, for instance. So now I have a number of different hashes, usually keyed to the kmer sequence which I will use in the next part of my project. I will probabaly also sort all the kmers in my hash/database to speed up the search process in the next steps. I amy even precompute a series of such kmer databases ahead of time for different sizes, simply to help with processing the data.

Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?

Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.

Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.

My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?

MadraghRua
yet another biologist hacking perl....

Replies are listed 'Best First'.
Re^5: when to c, when to perl
by TGI (Parson) on Jul 26, 2008 at 01:19 UTC

    tilly is correct. For a case where the main issue is simply small RAM, adding another library will just use more memory and exacerbate the problem.

    Your situation is different. BerkeleyDB is certainly up to the task. I used the term dbfile as a generic term for databases like Berkeley and GDBM. I'm sorry if this sloppy usage was confusing.

    I don't work much with large scale data processing tasks like this, but it looks like you are on the right track with your plans.

    Using the berkeleyDB is a way to offload memory intensive and speed critical operations to a C library through XS. Exactly what has been widely advocated in this thread. The best thing is that someone else has already written and carefully optimized this code. What could be better than that?


    TGI says moo

Re^5: when to c, when to perl
by tilly (Archbishop) on Jul 25, 2008 at 21:58 UTC
    You misread that. He wasn't saying that dbfiles are better than BerkeleyDB. It would make little sense to say that since BerkelyDB is nothing more or less than a specific kind of dbfile.

    Instead he said that in the case he described, BerkeleyDB would be a bad choice. In other cases a dbfile would be a good choice. So it is all about why different cases make a different difference.

    For your problem a dbfile is a reasonable choice. But I'll note that if you can you really want to pre-sort your data then store it in BTrees as much as possible. That will massively improve your locality of reference, which will reduce disk seeks. And I guarantee that with that problem you're being killed on disk seeks.