comment on

Hello monks,

In a recent post Berkeley DB performance, profiling, and degradation..., I posed a query about database performance with the Berkeley DB and DB_File (and by the end of the thread, BerkeleyDB and MLDBM::Sync). I think performance of the database portion of my code is as fast it is going to get. But I'm still shy of the mark that my code absolutely must hit to work. So, out of a stubborn belief that this sort of code shouldn't have to be written in C/C++ to work, I'm going to keep on beating on it, I hope with a little help.

The most expensive, and third most repeated call in my program is called add_entry. In profile output, 105,000 entries takes roughly 571 seconds of exclusive time and 1450 cumulative seconds--obviously it isn't the only time user, but it is the most guilty party now that I've tuned the db access as much as seems possible. I've already optimized for the most used path, and eliminated all of the extra database fetches and unnecessary variable creations that I could find. So now I can't spot anything left to eliminate or tune, but I know there are a thousand or more monks here who know more than I about these things.

So, without further ado, I give you the code:

sub add_entry {
    $log->print("Entering add_entry.", 3);
    my ($md5, $method, $key, $exists, $children) = @_;
    my ($parent, $parentmd5, $parentchildren);

    unless ( $children ) { $children = ''; }

    $urldb->db_put($md5, (join(' ', $key, $exists, $children)));

    # It's an old entry--it already has parents.  Can leave.
    if ( $children =~ ':' ) {
        return;
    }

    $parent = find_parent($key);

    # No more parents, or can't parent ourselves...
    if ( $parent eq "http:/" or $parent eq $key ) {
        return;
    }

    $parentmd5 = uc(md5_hex($method, $parent));

    if ( $urldb->db_get($parentmd5, $object)==0 ) {
        $parentchildren = (split (' ', $object))[2];
    }
    else {
        $parentchildren = '';
    }

    unless ( $md5 ) { $md5 = ''; $log->print("What, I say, what?!", 3)
+; }
    # Is this child already listed?
    unless ( $parentchildren =~ $md5 ) {
        if ( $parentchildren ) {
            $log->print("Inserting parent: $parent with children: $par
+entchildren:$md5", 3);
            add_entry($parentmd5, $method, $parent, 0, "$parentchildre
+n:$md5");
        }
        else {
            $log->print("Inserting parent: $parent with child: $md5", 
+3);
            add_entry($parentmd5, $method, $parent, 0, $md5);
        }
    }
    return;
}
[download]

The common case is the first one, where an object is updated with new children (this happens roughly twice as often as a new entry in my test data set). The we do a short circuit check to see if we were called without children, meaning that the object doesn't exist already--so create a new object and add an entry to the parent to point to it.

After the children check, we proceed with figuring out whether we can add a parent (some objects can't have a parent because they are already the root object, for example). If a parent can happen we call add_entry again with the parent data. I was using a join on the first parent add_entry call to begin with (joining the $md5 with the old $parentchildren with a colon), but I kind of suspect a fixed string template version is faster.

So, am I missing something in this routine that needs tweaking? Can I reduce, refactor, rewrite any of this routine to make it more efficient? Memory efficiency is literally of zero value here--we have tons of memory, but very little time in which to handle a lot of data (15 entries per second is the absolute minimum, while sharing the CPU/disk with an extremely power hungry daemon). Right now, by the time the index reaches a half million entries, every new entry costs more than an eighth of a second (sometimes a lot more). So I've got to shave about a quarter second off of each entry, or rewrite this thing in C/C++. Am I doomed to writing and maintaining a low level language app?

Thanks for any pointers anyone can offer up.

In reply to Performance quandary by SwellJoe

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks