Perl Monk, Perl Meditation | |
PerlMonks |
Choosing the right sort (Or: WUDsamatter?)by friedo (Prior) |
on Apr 14, 2005 at 01:34 UTC ( [id://447633]=perlmeditation: print w/replies, xml ) | Need Help?? |
A couple weeks ago I got somewhat annoyed when several people reccomended a Schwartzian Transform for an extremely trivial sorting operation involving at most a few dozen items. The Schwartzian is really cool, and it is a technique that every Monk should know and understand. But there are times when it's simply a waste. The solution I posted was simple, straightforward, and inefficient. Also, unlike some of the untested Schwartzian examples posted, it worked. It was:
This code is not fast, but it's easy to write, understand, and debug. I call this property WUD. If your code has a lot of WUD you probably finished the job sooner than expected and made life easier for the next person who has to work on it. Unfortunately, WUD is usually inversely proportional to efficiency. Optimization, therefore, is the art of determining whether speed or WUD is more important in a given situation. Before we continue, let's remind ourselves of one fact: We are talking about Perl. Perl is slow. It eats up gobs of memory has painful overhead for sometimes annoyingly trivial operations. But Perl has tons of WUD. It's loaded with WUD. Fifteen minutes hacking out a one-off Perl program can accomplish tasks that would take a C programmer months and a shell scripter days. But even with all that overhead, that naive sorting algorithm, which must stupdily re-do the regex repeatedly on the same data, takes less than a quarter of a second to sort a list of one hundred strings on my computer. Is the WUD worth it? I think so. Here I will look at four of the most common sorting idioms used in Perl on lists of various sizes. We have a list of strings in the form /foo\d{1,3}\.tla/ and we want to sort them numerically by those digits in the middle. Perl's default lexicographic sorter won't work, because the digits are not zero-padded on the left, and the list will come out looking like:
For the benchmarks below I will use a list of 250 randomly generated strings of this type. Let's take a look at our four idioms: Here is the naive sort, a variation of which we saw above:
Here is the Orcish Maneuver, which is the same as the naive sort, except we cache the results of the regexes so we only have to do them once per item. (Orcish is a pun on "or-cache.")
Here is the ubiquitous Schwartzian Transform, which constructs a list of array references containing the original datum and the chunk returned by the regex, sorts on the latter chunk, and then maps out the original datum from the resulting list.
Finally, here is the Guttman-Rosler Transform, which is similar to the Schwartzian, except instead of array references it constructs strings containing the numbers padded on the left. The strings are then sorted using Perl's built-in default lexicographic sorter, which is very fast C and does not rely on a Perl callback.
Here is a benchmark showing the speed of these algorithms on a list of 250 randomly generated strings. NOTE: The following benchmark results are wrong. See the Update at the bottom.
On our relatively tiny data set, the two simplest sorts perform magnificently compared to the ST and GRT. In fact, at even five thousand iterations, naive and orcish emitted warnings that the benchmark would not be accurate. Further testing reveals that the overhead of the caching operation provides no speed benefit over the naive sort for this small data set. Keep in mind that this list is ten times as big as the one in the original thread that inspired this screed. So how big must our data sets be in order for the extra overhead of the ST and GRT to worth the reduction in WUD? A benchmark with one thousand strings shows that naive and orcish are nearly unchanged, while the execution time for both the ST and GRT increased by more than four times. (However, the GRT is nearly twice as fast as the ST.) Further increases in data size would be an exercise in futility. What we've discovered is that performing this simple regex is actually less costly than the significant overhead in setting up the ST and GRT. The lesson here is that these fancy sorting algorithms are only useful when the operation to construct the sort-key (in this case, extracting the numbers with a regex) is more expensive than the overhead of the fanciness. So how do we decide? Run benchmarks on everything? Not even necessary. Just ask yourself the following questions:
In short, a bit of forethought and common-sense can save you a lot of trouble down the road.
Happy sorting. Update: Thanks to Pustular and ikegami for pointing out the Benchmark mistake. (I had forgotten that sort does a no-op if in void context.) After re-testing I've found that the Schwartzian becomes more efficient than the naive sort for this operation for lists bigger than 200 items, and the GRT at 120 on my machine. So I was clearly wrong about the extent of the ST and GRT overhead. My point about eliminating complexity for trivial cases still stands, though. :)
Back to
Meditations
|
|