Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hello monks,

In a recent post Berkeley DB performance, profiling, and degradation..., I posed a query about database performance with the Berkeley DB and DB_File (and by the end of the thread, BerkeleyDB and MLDBM::Sync). I think performance of the database portion of my code is as fast it is going to get. But I'm still shy of the mark that my code absolutely must hit to work. So, out of a stubborn belief that this sort of code shouldn't have to be written in C/C++ to work, I'm going to keep on beating on it, I hope with a little help.

The most expensive, and third most repeated call in my program is called add_entry. In profile output, 105,000 entries takes roughly 571 seconds of exclusive time and 1450 cumulative seconds--obviously it isn't the only time user, but it is the most guilty party now that I've tuned the db access as much as seems possible. I've already optimized for the most used path, and eliminated all of the extra database fetches and unnecessary variable creations that I could find. So now I can't spot anything left to eliminate or tune, but I know there are a thousand or more monks here who know more than I about these things.

So, without further ado, I give you the code:

sub add_entry { $log->print("Entering add_entry.", 3); my ($md5, $method, $key, $exists, $children) = @_; my ($parent, $parentmd5, $parentchildren); unless ( $children ) { $children = ''; } $urldb->db_put($md5, (join(' ', $key, $exists, $children))); # It's an old entry--it already has parents. Can leave. if ( $children =~ ':' ) { return; } $parent = find_parent($key); # No more parents, or can't parent ourselves... if ( $parent eq "http:/" or $parent eq $key ) { return; } $parentmd5 = uc(md5_hex($method, $parent)); if ( $urldb->db_get($parentmd5, $object)==0 ) { $parentchildren = (split (' ', $object))[2]; } else { $parentchildren = ''; } unless ( $md5 ) { $md5 = ''; $log->print("What, I say, what?!", 3) +; } # Is this child already listed? unless ( $parentchildren =~ $md5 ) { if ( $parentchildren ) { $log->print("Inserting parent: $parent with children: $par +entchildren:$md5", 3); add_entry($parentmd5, $method, $parent, 0, "$parentchildre +n:$md5"); } else { $log->print("Inserting parent: $parent with child: $md5", +3); add_entry($parentmd5, $method, $parent, 0, $md5); } } return; }
The common case is the first one, where an object is updated with new children (this happens roughly twice as often as a new entry in my test data set). The we do a short circuit check to see if we were called without children, meaning that the object doesn't exist already--so create a new object and add an entry to the parent to point to it.

After the children check, we proceed with figuring out whether we can add a parent (some objects can't have a parent because they are already the root object, for example). If a parent can happen we call add_entry again with the parent data. I was using a join on the first parent add_entry call to begin with (joining the $md5 with the old $parentchildren with a colon), but I kind of suspect a fixed string template version is faster.

So, am I missing something in this routine that needs tweaking? Can I reduce, refactor, rewrite any of this routine to make it more efficient? Memory efficiency is literally of zero value here--we have tons of memory, but very little time in which to handle a lot of data (15 entries per second is the absolute minimum, while sharing the CPU/disk with an extremely power hungry daemon). Right now, by the time the index reaches a half million entries, every new entry costs more than an eighth of a second (sometimes a lot more). So I've got to shave about a quarter second off of each entry, or rewrite this thing in C/C++. Am I doomed to writing and maintaining a low level language app?

Thanks for any pointers anyone can offer up.

In reply to Performance quandary by SwellJoe

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [Corion]: I'm interested in how their product actually works. It uses websockets between two non-browser components, so there are some weird technology decisions, or rather, I assume the product started out as something completely different :)
    [erix]: fair enough

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (7)
    As of 2018-04-19 12:34 GMT
    Find Nodes?
      Voting Booth?