Re (tilly) 1: Performance quandary

in reply to Performance quandary

Are you sure that the database is as optimized as it can be? Based on what is pointed out here I would suggest trying to make it a BTREE instead of a hash.

I also note that your performance figures strike me as very odd. As you say, your figures shouldn't scale that badly, and dbms haven't when I have tried it. Based on that fact I would look for things you might be doing that would result in very large values that you keep on fetching back and writing (so that the dbm may be finding them very efficiently, but you would be writing them very slowly). And what I see is that you are keeping lots of data in the values, and that data set is constantly growing as you add kids.

Right there I see evidence of a bad algorithm. Rather than storing key/value pairs with large amounts of information in the values which you keep on fetching and manipulating, you would like to have a more complex data structure which you can add to with less work. Do this and you should have no scalability problems at all. Conversely if you try to write in C and continue to use this data structure, you will hit the same performance problem.

As for how you want to implement your data structure, that is up to you. I would consider this to be a good "proof of concept" project for DBI with DBD::SQLite.

Alternately you can sit down with a good data structures and algorithms book and try to roll your own data structure that scales better to large numbers of kids. For instance you might want to store in the parent an entry with some structured information, one of the pieces of which is how many kids you have. To add a kid, pull this out, parse it, increment the number of kids, add an entry with a key like: "$parent_mdb|$kid_no" in which you store the kid, and store the parent again. Sure, you have to edit two entries, but both are small so you get the performance you had when you had few entries and don't ever degrade.

Comment on Re (tilly) 1: Performance quandary

Replies are listed 'Best First'.
Re: Re (tilly) 1: Performance quandary by SwellJoe (Scribe) on Feb 24, 2002 at 04:12 UTC
Thanks for your thoughts, Tilly. BTree gave me about 5% (already tried it a couple of times with both DB_File and BerkeleyDB). The current system is using BTree (I should update the previous database post to show the most recent numbers and specific configuration choices). I think you might have tapped into something with the notion of a very simple write (except a lot more of them) rather than a pull->parse->add->write on the parent each time. My reason for choosing the data structure I have, is that from a single parent I must be able to quickly poll through all of its children and subchildren. The key requirement for the parent->child relationship is that from any parent, all of its children can be found. The child doesn't need to store its parent, because that can be generated from what we know of the child (the URL--find_parent already does this in a ~two line function). That said, I think you're probably right about removing the requirement for pulling and pushing large objects. Though the objects don't grow as much as the real world behavior indicates they do. Anyway, I won't know until I try it, so I'm going to try to figure out a database structure that will permit this kind of relationship without requiring the parent to store everything about its immediate kids. It seems I'm going to need two entries per object to account for the 'any child can be a parent to other objects' paradigm I'm dealing with. So $parent_mdb\|$kid_no will store the object info, while the $kid_mdb will store its child info, plus the parent key so the first object can be removed when this one is. I think this is necessary since we need to be able to seek to any object...I suppose I could, in the seek code use find_parent to seek up the tree until the parent is located and then poll back down to find the object. More efficient to have two entries, I presume? I guess I'll just go try it both ways and see which one makes me wait the longest. I'll give DBI and DBD::SQLite a perusal as well. Will be interesting to see what works best. Results to follow...	[reply]

Replies are listed 'Best First'.

Re: Re (tilly) 1: Performance quandary
by SwellJoe (Scribe) on Feb 24, 2002 at 04:12 UTC

BTree gave me about 5% (already tried it a couple of times with both DB_File and BerkeleyDB). The current system is using BTree (I should update the previous database post to show the most recent numbers and specific configuration choices).

I think you might have tapped into something with the notion of a very simple write (except a lot more of them) rather than a pull->parse->add->write on the parent each time.

My reason for choosing the data structure I have, is that from a single parent I must be able to quickly poll through all of its children and subchildren. The key requirement for the parent->child relationship is that from any parent, all of its children can be found. The child doesn't need to store its parent, because that can be generated from what we know of the child (the URL--find_parent already does this in a ~two line function).

That said, I think you're probably right about removing the requirement for pulling and pushing large objects. Though the objects don't grow as much as the real world behavior indicates they do. Anyway, I won't know until I try it, so I'm going to try to figure out a database structure that will permit this kind of relationship without requiring the parent to store everything about its immediate kids.

It seems I'm going to need two entries per object to account for the 'any child can be a parent to other objects' paradigm I'm dealing with. So $parent_mdb|$kid_no will store the object info, while the $kid_mdb will store its child info, plus the parent key so the first object can be removed when this one is. I think this is necessary since we need to be able to seek to any object...I suppose I could, in the seek code use find_parent to seek up the tree until the parent is located and then poll back down to find the object. More efficient to have two entries, I presume?

I guess I'll just go try it both ways and see which one makes me wait the longest. I'll give DBI and DBD::SQLite a perusal as well. Will be interesting to see what works best. Results to follow...

[reply]

In Section Seekers of Perl Wisdom