Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^4: when to c, when to perl

by MadraghRua (Vicar)
on Jul 25, 2008 at 20:51 UTC ( #700220=note: print w/ replies, xml ) Need Help??


in reply to Re^3: when to c, when to perl
in thread when to c, when to perl

Could you please elaborate on why adding berkelyDB to the mix would be worse than Dbfile? Let me give a for instance. I have a bacterial genome of 5 million bases. I want to break this up into kmers of various sizes. I need to pay attention to both DNA strands, so I record the orientation in which I see the kmer. For each kmer I want to see if it has already been seen. If so, increment the number of times the kmer was seen, record where it was seen, record the orientation of the kmer. Now go through the kmers and record which ones have low sequence comlexity - lots of repeats or other characterisitcs that might make identifying overlapping kmers difficult, for instance. So now I have a number of different hashes, usually keyed to the kmer sequence which I will use in the next part of my project. I will probabaly also sort all the kmers in my hash/database to speed up the search process in the next steps. I amy even precompute a series of such kmer databases ahead of time for different sizes, simply to help with processing the data.

Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?

Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.

Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.

My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?

MadraghRua
yet another biologist hacking perl....


Comment on Re^4: when to c, when to perl
Re^5: when to c, when to perl
by tilly (Archbishop) on Jul 25, 2008 at 21:58 UTC
    You misread that. He wasn't saying that dbfiles are better than BerkeleyDB. It would make little sense to say that since BerkelyDB is nothing more or less than a specific kind of dbfile.

    Instead he said that in the case he described, BerkeleyDB would be a bad choice. In other cases a dbfile would be a good choice. So it is all about why different cases make a different difference.

    For your problem a dbfile is a reasonable choice. But I'll note that if you can you really want to pre-sort your data then store it in BTrees as much as possible. That will massively improve your locality of reference, which will reduce disk seeks. And I guarantee that with that problem you're being killed on disk seeks.

Re^5: when to c, when to perl
by TGI (Vicar) on Jul 26, 2008 at 01:19 UTC

    tilly is correct. For a case where the main issue is simply small RAM, adding another library will just use more memory and exacerbate the problem.

    Your situation is different. BerkeleyDB is certainly up to the task. I used the term dbfile as a generic term for databases like Berkeley and GDBM. I'm sorry if this sloppy usage was confusing.

    I don't work much with large scale data processing tasks like this, but it looks like you are on the right track with your plans.

    Using the berkeleyDB is a way to offload memory intensive and speed critical operations to a C library through XS. Exactly what has been widely advocated in this thread. The best thing is that someone else has already written and carefully optimized this code. What could be better than that?


    TGI says moo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://700220]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2014-12-27 05:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls