Welcome to the Monastery | |
PerlMonks |
Berkeley DB performance, profiling, and degradation...by SwellJoe (Scribe) |
on Feb 19, 2002 at 14:13 UTC ( [id://146377]=perlquestion: print w/replies, xml ) | Need Help?? |
SwellJoe has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
I've been working for a couple of weeks on a utility to index up to a couple million web objects (just the meta-data, not the actual objects) into a Berkeley DB. In this program I've used a tied hash for storage of the data and transparently utilizing the db. No explicit file locking is taking place at this point. Each object entry requires at least one FETCH (to see if the object already exists), and one STORE (potentially more, as a parent tree is built for every object--even if the parents don't actually exist in the cache, though they are marked as non-existent). Each entry has a URL, an existence indicator, and a list of child objects. It is working correctly, but is nowhere near fast enough for the task at hand (which is to follow a running Squid and index the object store). It is quite zoomy up to a few thousand objects, handling over 200 entries per second for the first few thousand. But then performance degrades rapidly, eventually climbing to more than a second per entry. I was considering a rewrite in C, but before embarking on such a drastic and time-consuming journey, I decided to do some profiling. What I found surprised me...so much so that I think I must be misinterpreting the results or (more likely) using the Berkeley DB_File module in ways that are inefficient. Here's the result of a profile of a ~30000 object run:
Now, from this, it is clear that my add_entry routine is taking most of the time with logging functions following close behind. I've already stripped the routine to the bone, and runtime dropped to a quarter its original. I've also moved the database (which never grows larger than 100MB) onto a RAM disk to remove disk latency, which brought the runtime down further. Woohoo! So I've done all I can think of to make things go fast enough. But it is still too slow for real-world use, as indicated by this profile of a run of about 60,000 objects, or twice the size of the previous run: Everything dealing with the DB is several times slower! Even so we're still able to maintain reasonable speeds at this point... The problem is that this slowing continues as more objects are entered into the database--yes, I know it has to get slower as the db has to seek more for each FETCH and STORE, but isn't the job of the Berkeley DB to manage the data efficiently enough to where an object retrieval or store doesn't require seconds, which is where we end up by the time the db has a million entries? Squid itself manages to store and retrieve entire objects plus this tiny amount of meta-data at several orders of magnitude faster than I'm able to coax from the BerkeleyDB module. What is interesting, to me, is that my add_entry routine isn't doing anymore work for an entry at 500,000 than it is for the very first entry. It is pulling it's data and parsing it from a single file of the exact same format every time, and then feeding it into the db-tied hash. The only difference is the size of the database--so the speed reduction isn't in my perl code, but in the Berkeley DB, as far as I can tell. So I'm posting in hopes that someone here knows how to make the BerkeleyDB perform reasonably with a 50-100MB database (I know...it's not a very big DB, I thought performance would be automatic). One of my thoughts for possible ways to proceed include using the database only for downtime storage, and using an in-memory hash with no ties for normal operation. This is not an elegant solution, as I then would have to integrate the program that uses this database into this one...leading to the requirement for multiplexing two disparate functions into a single daemon process, so they could efficiently share the data. I don't know if this would be significantly faster, however, if at all. I've got plenty of memory to work with (virtually unlimited in this scenario--the total meta-index only requires 50-100MB, so my 2GB boxes won't even notice if I keep it in core), so I can trade memory for speed. I've tried increasing the $DB_HASH{cachesize} parameter to 8 and 16MB with literally no measurable result, which is not too surprising since I'm already operating only on RAM. Same result from switching to a $DB_BTREE database. Another thought was to move the DB writes into their own function, and performing 'batch' entries by calling the function with a reference to a hash of db entries. However, I can't seem to locate any documentation on how to do a batch STORE...also the current code is moderately recursive, making batches difficult (each batch would be 256 objects, and some of those objects may need to update some of the others in the same batch causing a break from the batch mode--I would need to treat the temporary storage the same way I currently treat the persistent storage). Complexity in this case would go way up, so I'm loathe to even begin down that path if there is any other alternative. Anyway, if someone has experience with DB_File indexing ~1 million objects (tiny objects--maximum of a couple hundred bytes each), with good performance, I'd be much obliged if you'd tell me how you did it. Thanks for any pointers anyone can provide.
Back to
Seekers of Perl Wisdom
|
|