in reply to Performance quandary

I'm a little surprised that nobody has mentioned you're dug in up to the axle with recursive calls of add_entry. What's more, the way it happens suggests that your database is expected to have an expandable number of columns, since you appear to be splitting out each directory as a $key.

As I understand it, you want to index a collection of URI's by an md5 digest. It looks like both the location and content are digested together. I don't grok exactly how much you need to digest of the paths. If you need to track moved files, perhaps you should digest files and locations seperately.

I think your scaling problems will disappear if you replace the pseudo:

sub yours (@)
    peel_off_an_extra_arg_with_sides or return
with a refactoring:
use URI;
sub ThisAddUri($uri)
        map md5Nstuff uri(shift)
    updateDB thatmap
or maybe better, make the db part just an error-resistant wrapper, and keep the data array prep a seperate call.

You will come out way ahead if you get rid of the recursive calls, however you do it.

After Compline,

Replies are listed 'Best First'.
Re: Re: Performance quandary
by SwellJoe (Scribe) on Feb 24, 2002 at 09:23 UTC
    Actually, the recursion is not an issue at all, as far as I can see. The depth to which any recursive call will go is extremely small, except the first time a parent tree is built. After that recursion only happens once per entry (which would simply have to be another function call to a routine doing almost the exact same thing as add_entry). I was curious about the recursion as well, suspecting a bug leading to full-depth recursion on every request, but profiler output confirms that depth of calls is not an issue here. I experimented some with taking out the escape clauses, and recursion depth shows up very clearly in the dprofpp -T output--with the routine as it exists here, the average depth does not grow measurably deeper as the program progresses. So we're definitely not 'in up to the axles' in would show in an ever increasing depth when profiling.

    In reply to, "As I understand it, you want to index a collection of URI's by an md5 digest"...We're indexing objects that already have an md5 digest (we get it from Squid, which indexes its object store by md5). This is a requirement of the database--it can't work without being indexed by md5s that match the Squid md5 index exactly, because otherwise we wouldn't be able to remove the item from our database when Squid RELEASES it. The digest I'm generating is for a parent object, which may or may not exist in the Squid object store--but we have to have it, because we need to be able to pull a complete object tree by requesting, for example "". Our DB has to be able to find all of the child objects of

    Now I don't actually understand your pseudo-code example (what is make_updateVectors?), but I suspect you're missing the critical purpose of the recursive call in this function: We need a complete tree in the database, all the way up to the root node of this URL. We must be able to make a request of this database for an object and all of its children. However, when building the database and when adding new objects when in daemon mode, we have no guarantee that all of the parent objects will exist in the database.

    So, perhaps I'm wrong, but it looks like you've created a function to simply add a single entry (plus possibly a single parent?) to the db. Now it may be entirely true that indexing both the object and its relationship to other objects in the tree (or its child objects anyway) is not the most efficient use of the database, but I don't see that using recursion is fundamentally flawed here or that an iterative solution would make a major performance difference. The usual recursive depth is one level (for the parent entry update)...that isn't significant overhead, is it? I don't see how it could be, unless Perl has a tremendous recursion overhead, which I have never seen mentioned anywhere.

    Am I missing something from this picture that is misleading me into thinking an iterative solution would alter the runtime very little, if at all? We still have to construct the parent tree, whether we do it recursively or iteratively--the number of function calls and db writes doesn't change, one way or another. Recursion seems the right way to do it to me, since we've already got a routine that does everything we need for every object up the iterative solution would require a repeat of most of the same code in add_entry within the block that iterates (I actually started writing it that way when I started, but soon realized I needed all of the same actions for the parent entries that I need for the current entry).