Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Performance quandary

by Anonymous Monk
on Feb 24, 2002 at 10:44 UTC ( #147166=note: print w/replies, xml ) Need Help??

in reply to Performance quandary

I won't comment on the Berkley DB optimisations, others have commented enough on this that i suspect you'll reach your 15/sec mark without issue.

However it is important to note that the greatest performance advances are always aquired by looking at the whole problem, rather than any particular small part of it.

In this case, optimising the Berkley DB is a lot of effort expended on what is really not the problem. A Berkley DB is designed to store data against a string-based key of unknown randomness, and unknown size.

In your favor, you have an excellent knowledge of the operating conditions, how many entries you need etc. You also have an excellent key algorithm for free, md5 has excellent spread and can be used as a hashing entity itself, rather than forcing the Berkley db to re-hash the hash as it were.

Thus my suggestion, if you really require performance, is not to re-write in C/C++, which will not resolve the problem, although it might get you the 15hits/sec you're looking for. I would instead re-write my database backend, in perl, to use something like a fixed-size disk file indexed by the first n characters of the md5 (whatever suits), with start-time configurable parameters for the number of buckets per entry and overflow so that you can tweak those. Then just seek into the thing using the hash and get the data you need.

Yes, its more work, specialisation always is, but it will give you the fastest performance you're likely to get without buying better hardware (on that note, just getting some faster hardware is always a good option :). If you've got the time, do some reading on high-performance data structures and see what you can find. Remember to boost kernel and on-disk caches for added performacne, remember to dump out lots of stats on disk/cache hits etc so you can work out whether you need to implement an intermediate in-memory application-specific cache, or use a different portion of the md5 string because its better distributed (unlikely :)

All these things will buy you serious performance boosts, not little fraction-of-a-second increments, at the cost of really having to understand what you are trying to achieve.

Moving to different languages etc, except in rare circumstances where a given language is actually a big part of the optimal solution, will never replace designing to perform.

Replies are listed 'Best First'.
Re: Re: Performance quandary
by SwellJoe (Scribe) on Feb 26, 2002 at 13:49 UTC
    This is an interesting thought, Anonymous.

    I'm cautious of reinventing the wheel here, because I have very little wheel building experience. The Berkeley DB ought to be able to do what I want--assuming I use it correctly, and after tilly's pointers and tips, I suspect I'm using it a bit incorrectly (or just expecting magic where there is none). At this point my plan is to rethink the way I interact with the database to reduce the size of the objects stored and avoid "exists?read->parse->modify->join->write" for every parent entry. I believe most of my problem is with the parenthandling, in that each parent can potentially grow rather large and gets modified a lot.

    Using tilly's ideas (and a couple that were pointed out to me in a chatterbox session, also with tilly) I'm going to use a database structure something like this:

    $parentmd5 == $url $exists $number_of_children

    $parentmd5:childnumber == $md5-pointer

    $childmd5 == $url $exists $number_of_children

    With this structure I will still have to update the parent entry to increment the number_of_children variable, but the entry will keep a fixed size (I suspect constant resizing is a hindrence) and an increment is cheaper than checking to see if the child is already in the list and appending it to a list of entries. I still have to address the check to see if the child already is in the list--which is where all of my troubles are coming in, as I don't know how to check to see if the child is already among the 'number_of_children' without polling through every child!

    So, it all comes down to this: My biggest quandary is how to handle the parent->child relationship efficiently....

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://147166]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2017-06-25 01:59 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (564 votes). Check out past polls.