|Keep It Simple, Stupid|
Bitten by the worst case (or why it pays to know whats inside the black box)by demerphq (Chancellor)
|on Jun 26, 2004 at 12:49 UTC||Need Help??|
Yesterday I had an interesting problem arise. I was working on some code that was supposed to cache some costly to produce data structures that are generated from a DB query. Since these data structures are both large in terms of memory, and slow in terms of creation, caching them represented a reasonable win in terms of both. Unfortunately however due to an oversight on my behalf the caching was ineffective and I was never reusing the old data and was always regenerating the same data structures.
After a bit of review I realized that the problem was due to the way I was implementing the cache lookup. The process was essentially this: produce the union of four queries each of which returned a variable number of two field records, (the first number being a telephone prefix, and the second being an ID that represented its pricing model) then join that data together to form a string key that could be used in a hash lookup to determine if the data was already in the cache. Because of technical reasons I couldnt just do the union in a single query so the union was performed by just pushing the result sets into a array.
The problem was that for various reasons two actually equivelent data sets might be returned in slightly different orders. So because virtually every time the order was different, the cache lookup always failed and I always regenerated the data. This meant that the broken caching actually was far worse than not caching at all, as the overall memory consumption (and thus swapping, etc) was vastly higher because I was storing multiple copies of the resulting large data structures. Naturally my first thought was "dummy, jusy sort the union array before you build the lookup key". So the addition of a sort call much (exactly actually) like
was added to the code. I then reran expecting the caching to now kick in and to see a nice little speedup of the process it was involved in.
It didn't work! In fact the result was that despite the fact that the caching was now clearly happening and previously created data strcutures were being reused, the overall process was actually much slower. As in at least 10 times slower!
This perplexed me. Why on earth would working caching be slower than no caching at all? It made no sense. I spent some time reviewing the code, and adding debug print statements to see where the extra time was coming from. I soon realized that the slowdown was in the sort statement, and after boggling for a bit over it something occured to me: Each of the four queries was returned sorted by the prefix. These were then merged end to end with a simple push. The result of this was that the @union array was nearly sorted. This was especially true as one of the queries was responsible for about 80% of the records, a second was responsible for about 15% of the records, with the other 5% coming from the remaining two queries. In fact it was the makeup of the three smaller queries that resulting the cache misses as the same record could be in any one of the three depending on circumstance.
So I was sorting an almost sorted array. Which in perl 5.6 (its not clear to me if this is still true in 5.8) is a very bad plan, as that version of perl uses a quicksort implementation. (Actually I had to consult on the CB to verify this, and thankfully Zaxo and tye confirmed that this was in fact the case.) Quicksort as those with a compsci background will recall is average case O(N log2 N), but it has a nasty property that its worst case is O(N*N). And its worst case is on nearly or totally sorted data. Sorting ~8000 records under these circumstances was taking about 17-20 seconds! (And on a farily high powered machine!)
So the worst case of quicksort had just bitten me right on the bum!
After zaxo and tye supported my suspicion that the slowdown was probably due to falling into the worst case of the sort algorithm I removed the "order by" clauses from the queries, and for good measure threw in a call to shuffle as well which left me with something like the following code:
Presto the ~20 second delay disappeared and the run time of the shuffle went down lower than the resolution I was able to easily measure. (I didnt bother with hires timing). Essentially the combined shuffle/sort was neglible time compared to the sort alone. Case closed and a nice little speedup for the program. :-)
I think the moral of this story is that the black box mentality that one often encounters in certain areas of development (and IMO particularly often in the perl programming community) is a bad thing. If you dont have at least a reasonable idea of how the black box performs its tasks, and thus its strengths and weakenesses then solving this kind of problem would be very difficult. Why should two opertions (and traversals of the array) be faster than one? A lot of programmers I suspect would have just shrugged and said "well sort knows best" and left it at that. Because I knew enough about the insides of the black box that is the perl sort function I was able to fairly easily identify the cause and then decide upon what I consider to be a counter-intuitive (at face value anyway) solution to resolve the problem.
My advice to folks out there is that when you work with black boxes (databases, indexing, modules, perl functions etc) do a bit of homework. Find out how they are implemented in at least abstract high level terms. Don't just assume that the black box works and thats all you need to know. Because one day the black box won't work as it should and if you don't understand at least the rudiments of its internals you'll probably never even know that there is something wrong. I'm not saying that you shouldn't use a DB or a module or perl builtin without reading the source code (although with CPAN modules there are other good reasons to do exactly that) but you should at least understand how the item works sufficient to be able to debug its use when it does go pear shaped.
Anyway, I hope this posting motivates some to inform themselves of the principles by which their regular tools do their jobs. Simply because something is a black box doesnt mean you cant get to grips with how it works at a high level. And that knowledge is very likely to pay off in spades one day.
First they ignore you, then they laugh at you, then they fight you, then you win.