|Pathologically Eclectic Rubbish Lister|
Re: Performance quandaryby Anonymous Monk
|on Feb 24, 2002 at 10:44 UTC||Need Help??|
I won't comment on the Berkley DB optimisations, others have commented enough on this that i suspect you'll reach your 15/sec mark without issue.
However it is important to note that the greatest performance advances are always aquired by looking at the whole problem, rather than any particular small part of it.
In this case, optimising the Berkley DB is a lot of effort expended on what is really not the problem. A Berkley DB is designed to store data against a string-based key of unknown randomness, and unknown size.
In your favor, you have an excellent knowledge of the operating conditions, how many entries you need etc. You also have an excellent key algorithm for free, md5 has excellent spread and can be used as a hashing entity itself, rather than forcing the Berkley db to re-hash the hash as it were.
Thus my suggestion, if you really require performance, is not to re-write in C/C++, which will not resolve the problem, although it might get you the 15hits/sec you're looking for. I would instead re-write my database backend, in perl, to use something like a fixed-size disk file indexed by the first n characters of the md5 (whatever suits), with start-time configurable parameters for the number of buckets per entry and overflow so that you can tweak those. Then just seek into the thing using the hash and get the data you need.
Yes, its more work, specialisation always is, but it will give you the fastest performance you're likely to get without buying better hardware (on that note, just getting some faster hardware is always a good option :). If you've got the time, do some reading on high-performance data structures and see what you can find. Remember to boost kernel and on-disk caches for added performacne, remember to dump out lots of stats on disk/cache hits etc so you can work out whether you need to implement an intermediate in-memory application-specific cache, or use a different portion of the md5 string because its better distributed (unlikely :)
All these things will buy you serious performance boosts, not little fraction-of-a-second increments, at the cost of really having to understand what you are trying to achieve.
Moving to different languages etc, except in rare circumstances where a given language is actually a big part of the optimal solution, will never replace designing to perform.