Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Performance quandary

by SwellJoe (Scribe)
on Feb 23, 2002 at 07:05 UTC ( #147051=perlquestion: print w/replies, xml ) Need Help??
SwellJoe has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

In a recent post Berkeley DB performance, profiling, and degradation..., I posed a query about database performance with the Berkeley DB and DB_File (and by the end of the thread, BerkeleyDB and MLDBM::Sync). I think performance of the database portion of my code is as fast it is going to get. But I'm still shy of the mark that my code absolutely must hit to work. So, out of a stubborn belief that this sort of code shouldn't have to be written in C/C++ to work, I'm going to keep on beating on it, I hope with a little help.

The most expensive, and third most repeated call in my program is called add_entry. In profile output, 105,000 entries takes roughly 571 seconds of exclusive time and 1450 cumulative seconds--obviously it isn't the only time user, but it is the most guilty party now that I've tuned the db access as much as seems possible. I've already optimized for the most used path, and eliminated all of the extra database fetches and unnecessary variable creations that I could find. So now I can't spot anything left to eliminate or tune, but I know there are a thousand or more monks here who know more than I about these things.

So, without further ado, I give you the code:

sub add_entry { $log->print("Entering add_entry.", 3); my ($md5, $method, $key, $exists, $children) = @_; my ($parent, $parentmd5, $parentchildren); unless ( $children ) { $children = ''; } $urldb->db_put($md5, (join(' ', $key, $exists, $children))); # It's an old entry--it already has parents. Can leave. if ( $children =~ ':' ) { return; } $parent = find_parent($key); # No more parents, or can't parent ourselves... if ( $parent eq "http:/" or $parent eq $key ) { return; } $parentmd5 = uc(md5_hex($method, $parent)); if ( $urldb->db_get($parentmd5, $object)==0 ) { $parentchildren = (split (' ', $object))[2]; } else { $parentchildren = ''; } unless ( $md5 ) { $md5 = ''; $log->print("What, I say, what?!", 3) +; } # Is this child already listed? unless ( $parentchildren =~ $md5 ) { if ( $parentchildren ) { $log->print("Inserting parent: $parent with children: $par +entchildren:$md5", 3); add_entry($parentmd5, $method, $parent, 0, "$parentchildre +n:$md5"); } else { $log->print("Inserting parent: $parent with child: $md5", +3); add_entry($parentmd5, $method, $parent, 0, $md5); } } return; }
The common case is the first one, where an object is updated with new children (this happens roughly twice as often as a new entry in my test data set). The we do a short circuit check to see if we were called without children, meaning that the object doesn't exist already--so create a new object and add an entry to the parent to point to it.

After the children check, we proceed with figuring out whether we can add a parent (some objects can't have a parent because they are already the root object, for example). If a parent can happen we call add_entry again with the parent data. I was using a join on the first parent add_entry call to begin with (joining the $md5 with the old $parentchildren with a colon), but I kind of suspect a fixed string template version is faster.

So, am I missing something in this routine that needs tweaking? Can I reduce, refactor, rewrite any of this routine to make it more efficient? Memory efficiency is literally of zero value here--we have tons of memory, but very little time in which to handle a lot of data (15 entries per second is the absolute minimum, while sharing the CPU/disk with an extremely power hungry daemon). Right now, by the time the index reaches a half million entries, every new entry costs more than an eighth of a second (sometimes a lot more). So I've got to shave about a quarter second off of each entry, or rewrite this thing in C/C++. Am I doomed to writing and maintaining a low level language app?

Thanks for any pointers anyone can offer up.

Replies are listed 'Best First'.
(crazyinsomniac) Re: Performance quandary
by crazyinsomniac (Prior) on Feb 23, 2002 at 08:48 UTC
    I have at least 2 things to tell you about.
    My comments start with ##
    sub add_entry { $log->print("Entering add_entry.", 3); my ($md5, $method, $key, $exists, $children) = @_; my ($parent, $parentmd5, $parentchildren); unless ( $children ) { $children = ''; } ## useless use of join ## "$key $exists $children" better used here $urldb->db_put($md5, (join(' ', $key, $exists, $children))); # It's an old entry--it already has parents. Can leave. ## useless use of a regular expression ## if( index($children,':') != -1) { # better used here if ( $children =~ ':' ) { return; } $parent = find_parent($key); # No more parents, or can't parent ourselves... if ( $parent eq "http:/" or $parent eq $key ) { return; } $parentmd5 = uc(md5_hex($method, $parent)); if ( $urldb->db_get($parentmd5, $object)==0 ) { ## inefficient use of split here ## see perlfunc split for more details (try the 3 arg) ## split /PATTERN/,EXPR,LIMIT ## so ## $parentchildren = (split(' ',$object,3))[2]; $parentchildren = (split (' ', $object))[2]; } else { $parentchildren = ''; } unless ( $md5 ) { $md5 = ''; $log->print("What, I say, what?!", 3) +; } # Is this child already listed? ## style point, do you know what $foo =~ $bar is? ## are you going to remember? ## is $md5 a regex parrern? ## always be explicit (add //) ## useless use of a regex here as well ### unless(index($parentchildren,$md5) != -1) { # is better unless ( $parentchildren =~ $md5 ) { if ( $parentchildren ) { $log->print("Inserting parent: $parent with children: $par +entchildren:$md5", 3); add_entry($parentmd5, $method, $parent, 0, "$parentchildre +n:$md5"); } else { $log->print("Inserting parent: $parent with child: $md5", +3); add_entry($parentmd5, $method, $parent, 0, $md5); } } return; }
    Now i've seen you test/manipulate $parentchildren a few times using m// or split ... you might think of turning $parentchildren into somekind of hash or something, cause memory is easier to spare than extra function calls, right? (i think so)

    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      Thanks crazyinsomniac,

      Your suggested changes made a measurable difference (5%+, I think). I'm making similar alterations to the rest of my routines now. Though their impact is much less than add_entry, it certainly won't hurt to drop out a few regexes and splits/joins.

      And I've learned a nifty new function, index(). I always knew there was a way to do that, I just never figured out what it was.

Re (tilly) 1: Performance quandary
by tilly (Archbishop) on Feb 23, 2002 at 23:23 UTC
    Are you sure that the database is as optimized as it can be? Based on what is pointed out here I would suggest trying to make it a BTREE instead of a hash.

    I also note that your performance figures strike me as very odd. As you say, your figures shouldn't scale that badly, and dbms haven't when I have tried it. Based on that fact I would look for things you might be doing that would result in very large values that you keep on fetching back and writing (so that the dbm may be finding them very efficiently, but you would be writing them very slowly). And what I see is that you are keeping lots of data in the values, and that data set is constantly growing as you add kids.

    Right there I see evidence of a bad algorithm. Rather than storing key/value pairs with large amounts of information in the values which you keep on fetching and manipulating, you would like to have a more complex data structure which you can add to with less work. Do this and you should have no scalability problems at all. Conversely if you try to write in C and continue to use this data structure, you will hit the same performance problem.

    As for how you want to implement your data structure, that is up to you. I would consider this to be a good "proof of concept" project for DBI with DBD::SQLite.

    Alternately you can sit down with a good data structures and algorithms book and try to roll your own data structure that scales better to large numbers of kids. For instance you might want to store in the parent an entry with some structured information, one of the pieces of which is how many kids you have. To add a kid, pull this out, parse it, increment the number of kids, add an entry with a key like: "$parent_mdb|$kid_no" in which you store the kid, and store the parent again. Sure, you have to edit two entries, but both are small so you get the performance you had when you had few entries and don't ever degrade.

      Thanks for your thoughts, Tilly.

      BTree gave me about 5% (already tried it a couple of times with both DB_File and BerkeleyDB). The current system is using BTree (I should update the previous database post to show the most recent numbers and specific configuration choices).

      I think you might have tapped into something with the notion of a very simple write (except a lot more of them) rather than a pull->parse->add->write on the parent each time.

      My reason for choosing the data structure I have, is that from a single parent I must be able to quickly poll through all of its children and subchildren. The key requirement for the parent->child relationship is that from any parent, all of its children can be found. The child doesn't need to store its parent, because that can be generated from what we know of the child (the URL--find_parent already does this in a ~two line function).

      That said, I think you're probably right about removing the requirement for pulling and pushing large objects. Though the objects don't grow as much as the real world behavior indicates they do. Anyway, I won't know until I try it, so I'm going to try to figure out a database structure that will permit this kind of relationship without requiring the parent to store everything about its immediate kids.

      It seems I'm going to need two entries per object to account for the 'any child can be a parent to other objects' paradigm I'm dealing with. So $parent_mdb|$kid_no will store the object info, while the $kid_mdb will store its child info, plus the parent key so the first object can be removed when this one is. I think this is necessary since we need to be able to seek to any object...I suppose I could, in the seek code use find_parent to seek up the tree until the parent is located and then poll back down to find the object. More efficient to have two entries, I presume?

      I guess I'll just go try it both ways and see which one makes me wait the longest. I'll give DBI and DBD::SQLite a perusal as well. Will be interesting to see what works best. Results to follow...

Re: Performance quandary
by dws (Chancellor) on Feb 23, 2002 at 20:26 UTC
    Are you sure about this line?   if ( $parent eq "http:/" or $parent eq $key ) {
    Is $parent really going to be "http:/" and not "http:" or "http://" (or a complete URL)? Or perhaps do you mean to do something like   if ( $parent =~ /^http:/ or $parent eq $key ) {
    The consequence of a false positive on your test is more entries, and thus more runtime.

      Yep, this part is correct, I'm pretty sure. What happens during find_parent() is that we strip off one path element and a '/'. So, a URL of becomes This is the Right Thing for the database key to match the Squid key for the same object (a requirement for this program).

      So when we get down to the 'root' element, of, this is the absolute last thing we can have an entry for. One more find_parent will strip the /, leaving http:/, which what I'm checking for.

      Right? I can't skip out on every entry that contains 'http:' because nothing will ever make it to the db. They /all/ have http: as the opener (again, this is so our MD5 hash of the object matches the Squid MD5 of the object--a requirement).

Re: Performance quandary
by mattr (Curate) on Feb 24, 2002 at 11:35 UTC
    Some obvious answers you probably covered first:
    - get more user time from cpu
    - get a real db and maybe offload to another machine
    - concentrate on raw writes within a guaranteed time rather than doing processing while writing, etc.
    But it seemed like you must have something pathological happening if a 200/sec routine drops below 10/sec.

    I'm curious, what happens when you don't use a DB at all and just use the current algorithm with an ordinary in memory hash?

    The first thing to check is that you are not using 5 year old software. You should build DB_File against Berkeley version 4 or, better yet use the BerkeleyDB module. You still seem not to get all the functionality of Berkeley DB 4 as provided by Sleepycat. That said, see below some other things you could do including using a database like Mysql or PostgreSql which also provide some neat tricks.

    There are new tables/functions in Mysql that might help. I wouldn't have said Mysql except this was brought to my attention when I had the great luck of having drinks with their David Atmark in Tokyo (Open Source Developers Conference at VA) two weeks ago. Anyway, It sounds like your extra processing and maybe some bugs in old BDB may push N log N so much closer to an exponential curve that it becomes unfeasible for million object tables at your current CPU level.

    - you don't need to do locking because BDB will do page-level, InnoDB row-level locking for you.
    - IN MEMORY table spec (maybe not such a big deal)
    - use placeholders, prepare once exec many times
    - batch adds (batch selects/inserts) via API, or as text file import from a file or pipe
    - create a number of writer processes with a preforked server. This way you can add more than one at a time, and row locking could block as necessary.
    - or, save up the adds that need to recurse and do them together. don't really know what you want.
    - pretune table to data structure/size. BDB makes extra rows for you but needs to know size first. So make it give you a lot more extras.
    - multiple indices. In particular, InnoDB maintains an insert buffer that is said to speed up similar databases 15 times. This reduces seeks created by the random distribution of secondary keys. Your current algorithm may actually be doing a seek every single time whereas a real DB caches primary index significantly.
    - consider a table type like MyISAM that allows you to do concurrent selects and inserts, if you think this is slowing you down with Mysql.
    - Use profiling/optimizing built into Mysql or other DBs
    - You might be interested in trying an extremely cool database called Firebird (this is Interbase, bug fixed and open sourced). Actually if you use this, then Firebird 2 (now C++) will be out very soon and give you even more boost. It is very lightweight, dependable (used in aerospace for a long time), and self-tuning. Firebird/Interbase includes "Declarative Referential Integrity Constraints" which means it will synchronize parent-child relationships between tables. Download manual from Borland.

    Other things that might help
    - DB will keep a log for you anyway. If you really want log, make it a DB table maybe. Shouldn't be taking so much time. Are you autoflushing to disk?
    - move processing into triggers, in Perl, Java, C, or PL/SQL
    - more complex structures, multiple tables, more space for time tradeoffs
    - Throw away all those joins and splits. No more strings, throw away all that "http" and "Inserting parent" etc. Too much Perl going on.

    Some reading materials, might help you.

    Insert speed mods - confirmed what I thought, that "The size of the table slows down the insertion of indexes by N log N (B-trees)." See tuning link.

    Hope this helps. It would be easier to understand what is happening if we could see exactly what the data looks like. Can you define what these trees are anyway? Is it just two levels, with a parent that is a domain name, plus a number (max is what?) of children? Or are you translating paths or XML info into parent-child relationships? Could this data be described as a 2-dimensional (or n-dim) table?

      That MySQL comment bothers me.

      With a properly designed BTree, the access time should scale logarithmically with the size of the table. The n*log(n) should be the overall time to build up the whole table (n operations, each of which is log(n)).

      Typical data sets range between, say, 1000 and 10**9 rows. Over that range log(n) only changes by a factor of 3, so for all intents and purposes it is just some constant. In real databases the access time is dominated by the time to seek to disk, and so (as my link in my previous post pointed out) the locality of reference of a BTREE can be a good win over hashing because it is more likely to hit pages in cache. Sure, asymptotically that log(n) is infinitely worse than any possible constant, but real life doesn't tend to reach that asymptotic nightmare. (Besides which BTREEs have a good worst-case guaranteed performance. Hashes don't.)

      On to other issues.

      Using a RAM disk does a lot to remove filesystem overhead. That should, among other things, catch most of the possible speed wins from extracting parallelism in commits (through insert buffers and other tricks). Come to think of it, most of the rest of the filesystem overhead can be removed just by naming your file undef, which tells Berkeley DB that it is dealing with an in memory database, removing the OS's filesystem abstraction layers from the picture.

      You do have a good point about doing too much Perl in the processs. Personally I strongly suspect that just changing to a better data structure will win the day. Alternately, if you can, using native hashes is better than pulling in Berkeley DB. But given Perl's memory overhead, I would be concerned about hitting the 32-bit addressing barrier, even if you have enough RAM in the machine to do it. (A wall we will be hearing about more and more often as time goes by...)

        it seems to me you are wanting an in-memory data structure, but you are tying the data to a database structure. if you are going to go this route, i would suggest looking into the index optimizations of BDB as tilly suggested with an 'undef' file, MySQL HEAP tables, and pure-perl structures like DBD::AnyData. A good index system will make or break your seek time, and you are seeking alot aren't you? i'm not sure how much overhead a connection to BDB or MySQL would add, but if it is significant DBD::AnyData in-memory table may be less by virtue of running as a perl module and not connecting to anything at all.

        if you are doing concurrent reads/writes the locking is of importance to you as well. you probably want to avoid anything that locks more than just the relevent rwo. if you can guarentee integrity another way, or 100% accuracy is unnecessary, maybe you can avoid locking the table at all.

        if you aren't doing concurrent processing, since you are farming the data-storage out to BDB you might be able to get away with (A) building the trees you have to make in a child process so several can be done concurrently, and (B) splitting your tables so you can do concurrent reads/writes. this probably won't get you much, but it would allow you to avoid some locking problems if your tables do lock, and it would pottentially allow you to run a separate DB and/or tree building on a second proceesor/machine to share the load.

        you also want to avoid anything resembling transaction logging as that inevitably is designed to trade speed for data integrity. i don't know BDB well, but in MySQL you would want to avoid InnoDB tables just for this reason.
      Thanks for your comments mattr,

      To answer your first few comments:

      - get more user time from cpu
      Not an option. We have an installed base of servers--this is a fixed resource. And this task is sharing with a CPU hungry daemon that always gets priority (Squid).

      - get a real db and maybe offload to another machine
      Another machine is not an option. Again with the fixed resources conundrum. A real db is an option, but I believe if I fix my "pathological" use of BerkeleyDB I won't need to, and would actually lose performance in the bargain. A real relational database can never be as fast as an equally tuned key:value database when used for storing key:value information--the relational layer, the SQL layer, and a bunch of other stuff insures that.

      - concentrate on raw writes within a guaranteed time rather than doing processing while writing, etc.
      This program will already run at an extreme niceness level (the Squid gets what she wants, everybody else waits his turn). The File::Tail module is handling the 'when you've got the time, I'd like to do some work' catching up stuff. This program is doing an important job, but one that has to take a back seat--thus the reason efficiency is important, it can't fall too far behind and still provide reasonable functionality.

      But it seemed like you must have something pathological happening if a 200/sec routine drops below 10/sec.
      I agree. There is something pathological--I believe in the parent insertion routines. It's just a matter of how to fix the pathology without kill the patient.

      Next interesting thought:

      - Throw away all those joins and splits. No more strings, throw away all that "http" and "Inserting parent" etc. Too much Perl going on.
      And replace them with what? The functionality is complete at this point--but is too slow. We can't lose functionality to fix the speed, as a useless program that runs really fast is still a useless program. ;-) I am beginning to suspect that using a more complex database that can handle some of the 'think' for me, rather than doing it all myself in perl might be worth the tradeoff. If I were using a relational DB, the parent->number_of_children could be incremented by accessing that particular entry rather than pulling in and parsing the whole parent structure--behind the scenes it still has complexity, but maybe the MySQL folks know how to handle that complexity much better than me and perl do.

      As I've said in another post this morning, the problem I believe, is boiling down to the parent->child relationship. How do I keep it current, accurate, and make it go fast? The most expensive thing in most add_entry calls is probably the second add_entry call that is made to handle the parent relationship. So if I can make the parent relationship cheap, I can fix the slowdown over time. I think.

      Anyway, thanks for the links and the pointers. I'm going to stick with BerkeleyDB for the time being, and attempt to make my code a little less pathological.

      The first step for making things less pathological is to:
      Add an in-memory hash to speed exsists? checks for the parent.
      Reduce the size of the parent entries, probably converting to a multiple entry system, to prevent changing the size of the parent (a number_of_children field with the kids referenced by $parentmd5:number). Hopefully this is cheaper than the current split/join fun that happens when inserting a child into a parents child list.

Re: Performance quandary
by Zaxo (Archbishop) on Feb 24, 2002 at 08:38 UTC

    I'm a little surprised that nobody has mentioned you're dug in up to the axle with recursive calls of add_entry. What's more, the way it happens suggests that your database is expected to have an expandable number of columns, since you appear to be splitting out each directory as a $key.

    As I understand it, you want to index a collection of URI's by an md5 digest. It looks like both the location and content are digested together. I don't grok exactly how much you need to digest of the paths. If you need to track moved files, perhaps you should digest files and locations seperately.

    I think your scaling problems will disappear if you replace the pseudo:

    sub yours (@)
        peel_off_an_extra_arg_with_sides or return
    with a refactoring:
    use URI;
    sub ThisAddUri($uri)
            map md5Nstuff uri(shift)
        updateDB thatmap
    or maybe better, make the db part just an error-resistant wrapper, and keep the data array prep a seperate call.

    You will come out way ahead if you get rid of the recursive calls, however you do it.

    After Compline,

      Actually, the recursion is not an issue at all, as far as I can see. The depth to which any recursive call will go is extremely small, except the first time a parent tree is built. After that recursion only happens once per entry (which would simply have to be another function call to a routine doing almost the exact same thing as add_entry). I was curious about the recursion as well, suspecting a bug leading to full-depth recursion on every request, but profiler output confirms that depth of calls is not an issue here. I experimented some with taking out the escape clauses, and recursion depth shows up very clearly in the dprofpp -T output--with the routine as it exists here, the average depth does not grow measurably deeper as the program progresses. So we're definitely not 'in up to the axles' in would show in an ever increasing depth when profiling.

      In reply to, "As I understand it, you want to index a collection of URI's by an md5 digest"...We're indexing objects that already have an md5 digest (we get it from Squid, which indexes its object store by md5). This is a requirement of the database--it can't work without being indexed by md5s that match the Squid md5 index exactly, because otherwise we wouldn't be able to remove the item from our database when Squid RELEASES it. The digest I'm generating is for a parent object, which may or may not exist in the Squid object store--but we have to have it, because we need to be able to pull a complete object tree by requesting, for example "". Our DB has to be able to find all of the child objects of

      Now I don't actually understand your pseudo-code example (what is make_updateVectors?), but I suspect you're missing the critical purpose of the recursive call in this function: We need a complete tree in the database, all the way up to the root node of this URL. We must be able to make a request of this database for an object and all of its children. However, when building the database and when adding new objects when in daemon mode, we have no guarantee that all of the parent objects will exist in the database.

      So, perhaps I'm wrong, but it looks like you've created a function to simply add a single entry (plus possibly a single parent?) to the db. Now it may be entirely true that indexing both the object and its relationship to other objects in the tree (or its child objects anyway) is not the most efficient use of the database, but I don't see that using recursion is fundamentally flawed here or that an iterative solution would make a major performance difference. The usual recursive depth is one level (for the parent entry update)...that isn't significant overhead, is it? I don't see how it could be, unless Perl has a tremendous recursion overhead, which I have never seen mentioned anywhere.

      Am I missing something from this picture that is misleading me into thinking an iterative solution would alter the runtime very little, if at all? We still have to construct the parent tree, whether we do it recursively or iteratively--the number of function calls and db writes doesn't change, one way or another. Recursion seems the right way to do it to me, since we've already got a routine that does everything we need for every object up the iterative solution would require a repeat of most of the same code in add_entry within the block that iterates (I actually started writing it that way when I started, but soon realized I needed all of the same actions for the parent entries that I need for the current entry).

Re: Performance quandary
by gav^ (Curate) on Feb 23, 2002 at 23:16 UTC
    I notice from your previous thread that logging is taking up %30 of the app's time. Is it possible to reduce the amount of logging?...


      Logging has been almost wiped from the face of the program. ;-) It is now only about 3-5% of time taken. (I was still debugging the logic of the program when the last profiles were posted). Also by the end of the run, when we're beyond half a million entries, the logging overhead gets lower and lower relative to everything else (because logging is always a fixed cost--the database management is where the problem comes in).
Re: Performance quandary
by Anonymous Monk on Feb 24, 2002 at 10:44 UTC
    I won't comment on the Berkley DB optimisations, others have commented enough on this that i suspect you'll reach your 15/sec mark without issue.

    However it is important to note that the greatest performance advances are always aquired by looking at the whole problem, rather than any particular small part of it.

    In this case, optimising the Berkley DB is a lot of effort expended on what is really not the problem. A Berkley DB is designed to store data against a string-based key of unknown randomness, and unknown size.

    In your favor, you have an excellent knowledge of the operating conditions, how many entries you need etc. You also have an excellent key algorithm for free, md5 has excellent spread and can be used as a hashing entity itself, rather than forcing the Berkley db to re-hash the hash as it were.

    Thus my suggestion, if you really require performance, is not to re-write in C/C++, which will not resolve the problem, although it might get you the 15hits/sec you're looking for. I would instead re-write my database backend, in perl, to use something like a fixed-size disk file indexed by the first n characters of the md5 (whatever suits), with start-time configurable parameters for the number of buckets per entry and overflow so that you can tweak those. Then just seek into the thing using the hash and get the data you need.

    Yes, its more work, specialisation always is, but it will give you the fastest performance you're likely to get without buying better hardware (on that note, just getting some faster hardware is always a good option :). If you've got the time, do some reading on high-performance data structures and see what you can find. Remember to boost kernel and on-disk caches for added performacne, remember to dump out lots of stats on disk/cache hits etc so you can work out whether you need to implement an intermediate in-memory application-specific cache, or use a different portion of the md5 string because its better distributed (unlikely :)

    All these things will buy you serious performance boosts, not little fraction-of-a-second increments, at the cost of really having to understand what you are trying to achieve.

    Moving to different languages etc, except in rare circumstances where a given language is actually a big part of the optimal solution, will never replace designing to perform.

      This is an interesting thought, Anonymous.

      I'm cautious of reinventing the wheel here, because I have very little wheel building experience. The Berkeley DB ought to be able to do what I want--assuming I use it correctly, and after tilly's pointers and tips, I suspect I'm using it a bit incorrectly (or just expecting magic where there is none). At this point my plan is to rethink the way I interact with the database to reduce the size of the objects stored and avoid "exists?read->parse->modify->join->write" for every parent entry. I believe most of my problem is with the parenthandling, in that each parent can potentially grow rather large and gets modified a lot.

      Using tilly's ideas (and a couple that were pointed out to me in a chatterbox session, also with tilly) I'm going to use a database structure something like this:

      $parentmd5 == $url $exists $number_of_children

      $parentmd5:childnumber == $md5-pointer

      $childmd5 == $url $exists $number_of_children

      With this structure I will still have to update the parent entry to increment the number_of_children variable, but the entry will keep a fixed size (I suspect constant resizing is a hindrence) and an increment is cheaper than checking to see if the child is already in the list and appending it to a list of entries. I still have to address the check to see if the child already is in the list--which is where all of my troubles are coming in, as I don't know how to check to see if the child is already among the 'number_of_children' without polling through every child!

      So, it all comes down to this: My biggest quandary is how to handle the parent->child relationship efficiently....

(cLive ;-) Re: Performance quandary
by cLive ;-) (Prior) on Feb 24, 2002 at 00:03 UTC
    After crazyinsomniac's comment, the other quick thing you can do is remove logging! I assume the logging sub apppends to a log file, yes?

    Once you know the system works, is it absolutely neccessary?


    cLive ;-)


      I suspect many will agree with me that logging is ALWAYS absolutely necessary. You will never be able to prove that your implementation of a non-trivial system "works". Ever. You can get to 99.9%, but you will always have that uncertainty. Always. Logging is the simplest method by which you, as the developer, can have an idea of what went wrong when (not if!) it does break.

      Never assume that a system a human develops will not break. Even if the human's activity was to write the code that wrote the code ... there's still an error in there. Somewhere.

      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Thanks dragonchild!

        I've been getting a lot of 'get rid of logging' or 'let BerkeleyDB handle logging'. But neither of those answers the problem of how do I let a user help me debug any problem that might arise once it is out of my control.

        It is true that logging calls were quite rampant in the first incarnation of the program (and most unnecessary calls are gone now that the program is known to work in all common-cases that I can generate). But some logging is still absolutely necessary, and I'm actually quite proud of the way I've implemented logging for this trivial little program.

        I ripped off, er, reused a nice little logging class written by Jeff Rowan, found here:

        And then added a loglevel value, so that the user (who has a configuration file for runtime changes to the system, of course) can increase the loglevel if problems arise. Ordinarily, logging is quite brief, and the calls to $log->print simply return after checking against the current loglevel. Pretty nifty (the ", 3" at the end of all of the logging statements is the loglevel for this entry--so it is a debug level log rather than a user level log).

Re: Performance quandary
by Xanatax (Scribe) on Feb 25, 2002 at 01:26 UTC
    From your previous question I got:

    Each object entry requires at least one FETCH (to see if the object already exists), and one STORE...

    My question is, Is this necessary? I don't know what you are trying do accomplish, but if memory is not an issue, and keeping up with SQUID is, then you want as fast a process as possible writing to your DB as SQUID runs. I haven't noticed if you've said anything about the time constraints on actually using this data... who or what is querying the datastructure after you build it? do they care if it takes longer to query? do they care if the order of complexity for a query is increased?

    if not, then we can fix you in no time... just don't do the FETCH op. STORE everything, and when you go to read the data (A) query all the lines you need, (B) delete them, (C)build your tree and give it to the user, and (D) re-insert the tree while the user is playing with it.

    also, you potentially could change the datastructure... if you can generate an md5 in constant time, and memory is not a problem, you could generate an array of the length of possible checksums, then you could read and write to it in constant time. more realisticly if memory is not an issue you could play with very large arrays, indexed on the first N chars of the checksum...

    point is, instead of building something that scales up well, you could build something that doesn't scale at all, and is just always going to be bigger than your dataset. that'll always win for speed.
      Interesting thoughts Xanatax,

      From an academic standpoint (since I'm writing this program to learn as much as to solve a problem, that definitely makes the academic valuable to me), I like the idea of playing with different data structures--even hugely different, such as an array in memory, indexed by part of the MD5. There are at least two problems I can see with this approach (I'm not wholly illiterate on data structures and indexing, and other fun things--I am a Squid nerd, after all), and I'd welcome thoughts or pointers to docs on solving them:

      We can't have 32 bytes worth of array entries, so we'd index on a part of the MD5. Even so, to guarantee uniqueness we need at least a few bytes--which leads to an extremely huge, but very sparsely populated array. Not ideal, even if memory is not an issue--I think memory would become an issue if we're using an array /that/ big. So how to handle sparse data structures of this sort?

      Persistence and concurrency. We have to be able to read the data from another process (the indexing itself is not the end, it is merely the means). So how to dump this out, in near realtime, in a form that is quickly accessible to another process? We don't need to modify the data from the other process, just read it.

      As for the no-FETCH, STORE everything proposal...I don't get it. I think it is missing the vital parent->child relationship component, if I understand you. I have considered deleting the parent each time and rewriting with the new child, but I suspect that is an optimization that the BDB handles already (and probably better than I can in Perl--since it is probably smart enough to know when it can insert into the existing 4kb space and when it will have to expand the space).

        well, if you are working on a 32bit machine, and you read 24bits worth of the md5 hash as your index to the array i get (32 * 2^24) bits of array, unpopulated, or 64MB. assuming you are reading the md5 as an 8bit ASCII string then you get three chars worth of it, which is a bit sad, but most of the 8 bits are redundant for the md5, so if you are lucky you should be able to compress each char to 6bits, via substitution, which gives you a fourth char for the same space. if you can't find four chars that vary enough between md5s (the last four chars?) can you double hash the keys for more variety? SQUID gives you md5, does your end program search by md5? if so you can still store the data by another key, you just have to store the original md5 with the rest of the data to match on.

        anyhow, if you run your data structure as a array of lists, with 16 million places in the array to reference, and 1-3 million actual items to index, you are below any load threshhold, so seeks will require on average only one or two compare operations. so long as you can keep the data spread evenly across the array, it should be well worth the memory used. the fact that it would be sparsely populated is of course the goal of what i was suggesting, then you have virtually constant seek time from zero to 3 mil data points.

        the total memory cost for this is 64MB for the array itself, again, assuming 32bit pointers, plus the 100-200MB data set you are already dealing with.

        on the FETCH/STORE issue, what i meant was: don't build the tree in this program, just feed the data into tables. build the trees in the program that is reading the tables. then you aren't doing a search for parents/children, a store, then another search for the trees, you are just storing the data unprocessed, and doing one search for everything and building the tree then. if you're end user program can be made to build the trees itself you can cut half the searching...
Re: Performance quandary
by mattr (Curate) on Feb 26, 2002 at 07:20 UTC
    Yes (tilly), I see. I read at Sleepycat that B-tree wins over hash, and then (if/when prescaled B-tree size exceeded) the hash wins again, but that sounds like just doing it wrong. I bow to your experience.

    My gut instinct was to leave all seeking and logic to a trigger in the DB, but say we just use Perl - how to get more detailed profiling info to find the culprit? We don't have info about these objects but how about the general case..

    I thought about the cache digest bitmap in Sourceforge squid docs, which uses 1 bit is used per object. This led me to imagine using a sequential memory map for 2 million objects, each of which contains address of parent and some other data. But realized locality in B-tree must beat this..

    Any ideas as to performance if you

    1) maintained memory map with Inline::C function which includes address of parent, an access counter, and a couple extra bytes. say 16MB for map plus an 8MB pool of free addresses. (or is this moot due to locality win in B-tree)
    2) skip the memory map, but also
    a) maintain a medium size B-tree of popular object MD5s, which are found because a counter in each object is incremented on seek. Added to populars tree when exceeding a pretuned threshold., hopefully this is searched very frequently instead of the big tree.
    b) maintain a small hash to cache last X number of objects and cut churn.

Re: Performance quandary
by dws (Chancellor) on Feb 26, 2002 at 08:31 UTC
    Here's a thought, based on incomplete evidence (and guessing at code you aren't showing (and based on having read some DBM source a long time ago))...

    When dealing with a hash tied to a DBM,   exists $hash{$key}
    does disk I/O. The larger the underlying database grows, the greater the number of page reads needed to check that the key exists. Getting the corresponding value requires additional page reads.

    Since a single script is creating the file, and you don't need to worry about concurrent access while the file is being created, it might make sense to short-circuit existing testing by adding an in-memory hash of valid keys.

    You might also consider using md5_base64(), which returns a 22 byte string instead of the 32 byte returned by md5_hex(). Given the number of records you're dealing with, that'll save you space and time.

      Thanks for your comments dws,

      I suspect you're probably right about an in-memory hash to speed lookups of existing keys. I will try this, in addition to the database structure changes suggested by tilly. I'm still spending some think time on how exactly to structure the db to minimize lookups while still keeping each object tiny and very simple to parse--everything I come up with seems to lead to more seeks/fetchs than I already have, but the objects do get smaller. An in-memory hash of existing keys would probably remove a lot of extraneous seeks.

      As for the MD5, you haven't read all of my posts! The MD5 we generate is a key requirement of the program, not an option, or just another way to generate a unique key. We have to match the Squid key for a given object--there is no way around that one. Besides, generating the MD5 is a miniscule part of CPU time being used. Since I haven't posted any profiles lately, I'll do so here (as good a place as any):

      Total Elapsed Time = 36923.77 Seconds User+System Time = 35132.22 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 40.2 14127 47435. 543237 0.0260 0.0873 main::add_entry 23.1 8134. 8133.2 543237 0.0150 0.0150 BerkeleyDB::Common::db_pu +t 21.2 7461. 7460.7 277763 0.0269 0.0269 BerkeleyDB::Common::db_ge +t 14.5 5121. 5123.0 109679 0.0047 0.0047 File::QuickLog::print 0.56 198.3 35089. 265678 0.0007 0.1321 main::process_file 0.18 62.28 105480 10282 0.0061 10.258 main::recurse_dir 0.11 37.37 36.272 279524 0.0001 0.0001 main::find_parent 0.08 27.52 25.366 543427 0.0001 0.0000 Digest::MD5::md5_hex 0.01 2.889 2.849 10302 0.0003 0.0003 File::QuickLog::_datetime 0.01 2.169 2.129 10300 0.0002 0.0002 IO::File::open 0.00 0.180 0.817 5 0.0360 0.1634 main::BEGIN 0.00 0.150 0.150 1 0.1500 0.1500 main::get_cache_dirs 0.00 0.150 0.478 1 0.1498 0.4783 IO::import 0.00 0.100 0.100 64 0.0016 0.0016 Exporter::import 0.00 0.060 0.100 6 0.0100 0.0166 IO::Socket::BEGIN
      This is a full build of a quarter million object cache (the real world will have 1-3 million objects but a much faster and more memoried machine). MD5 generation is 25 seconds out of 36923.77 seconds...I'm not worried about MD5 time. ;-)

Re: Performance quandary
by SwellJoe (Scribe) on Mar 04, 2002 at 18:28 UTC
    Following the advice of the ever-helpful (and ever-correct) tilly, I've converted my indexing program to use DBI::SQLite with excellent results!

    Because SQLite is a fully relational database, I was able to strip out all of my relationship handling perl which has been proven to be quite pathological with large databases of related objects. Now, the client application will have to be rewritten, and much of the relationship think time will go over to the read-side of the equation, but I'm hopeful that it won't be a significant issue (client reads can take several seconds to poll through a full hierarchy without becoming a major problem).

    So, I believe my performance problems have been solved by exporting all relationship handling to a relational database (imagine that!). Thanks to all who offered up suggestions (especially those who pointed out the potentially pathological issues with handling relationship data as the original code was guilty of doing). The upside to all of my problems has been that the rest of my code is extremely streamlined now, and runs considerably more efficiently than it would have had I gotten the database problem solved right the first time.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://147051]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2018-06-23 23:46 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.