Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
A couple weeks ago I got somewhat annoyed when several people reccomended a Schwartzian Transform for an extremely trivial sorting operation involving at most a few dozen items. The Schwartzian is really cool, and it is a technique that every Monk should know and understand. But there are times when it's simply a waste. The solution I posted was simple, straightforward, and inefficient. Also, unlike some of the untested Schwartzian examples posted, it worked. It was:

@files = sort { my ($ad) = ( $a =~ /fwlog\.(\d+)\w+/ ); my ($bd) = ( $b =~ /fwlog\.(\d+)\w+/ ); $ad <=> $bd } @files;

This code is not fast, but it's easy to write, understand, and debug. I call this property WUD. If your code has a lot of WUD you probably finished the job sooner than expected and made life easier for the next person who has to work on it. Unfortunately, WUD is usually inversely proportional to efficiency. Optimization, therefore, is the art of determining whether speed or WUD is more important in a given situation.

Before we continue, let's remind ourselves of one fact: We are talking about Perl. Perl is slow. It eats up gobs of memory has painful overhead for sometimes annoyingly trivial operations. But Perl has tons of WUD. It's loaded with WUD. Fifteen minutes hacking out a one-off Perl program can accomplish tasks that would take a C programmer months and a shell scripter days. But even with all that overhead, that naive sorting algorithm, which must stupdily re-do the regex repeatedly on the same data, takes less than a quarter of a second to sort a list of one hundred strings on my computer. Is the WUD worth it? I think so.

Here I will look at four of the most common sorting idioms used in Perl on lists of various sizes. We have a list of strings in the form /foo\d{1,3}\.tla/ and we want to sort them numerically by those digits in the middle. Perl's default lexicographic sorter won't work, because the digits are not zero-padded on the left, and the list will come out looking like:

foo1.tla foo10.tla foo100.tla foo101.tla foo102.tla etc.

For the benchmarks below I will use a list of 250 randomly generated strings of this type. Let's take a look at our four idioms:

Here is the naive sort, a variation of which we saw above:

sub naive { sort { my ( $ma ) = $a =~ /foo(\d+)\.tla/; my ( $mb ) = $b =~ /foo(\d+)\.tla/; $ma <=> $mb } @_; }

Here is the Orcish Maneuver, which is the same as the naive sort, except we cache the results of the regexes so we only have to do them once per item. (Orcish is a pun on "or-cache.")

sub orcish { my %c; sort { $c{$a} or ( $c{$a} ) = $a =~ /foo(\d+)\.tla/; $c{$b} or ( $c{$b} ) = $b =~ /foo(\d+)\.tla/; $c{$a} <=> $c{$b}; } @_; }

Here is the ubiquitous Schwartzian Transform, which constructs a list of array references containing the original datum and the chunk returned by the regex, sorts on the latter chunk, and then maps out the original datum from the resulting list.

sub schwartzian { map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, /foo(\d+)\.tla/ ] } @_; }

Finally, here is the Guttman-Rosler Transform, which is similar to the Schwartzian, except instead of array references it constructs strings containing the numbers padded on the left. The strings are then sorted using Perl's built-in default lexicographic sorter, which is very fast C and does not rely on a Perl callback.

sub guttros { map { substr( $_, 3 ) } sort map { sprintf( "%03d", /foo(\d+)\.tla/ ) . $_ } @_; }

Here is a benchmark showing the speed of these algorithms on a list of 250 randomly generated strings.

NOTE: The following benchmark results are wrong. See the Update at the bottom.

Benchmark: timing 5000 iterations of guttros, naive, orcish, schwartzian... guttros: 14 wallclock secs (14.10 usr + 0.00 sys = 14.10 CPU) @ 354.61/s (n=5000) naive: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU) @ 125000.00/s (n=5000) orcish: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU) @ 125000.00/s (n=5000) schwartzian: 26 wallclock secs (25.70 usr + 0.03 sys = 25.73 CPU) @ 194.33/s (n=5000)

On our relatively tiny data set, the two simplest sorts perform magnificently compared to the ST and GRT. In fact, at even five thousand iterations, naive and orcish emitted warnings that the benchmark would not be accurate. Further testing reveals that the overhead of the caching operation provides no speed benefit over the naive sort for this small data set. Keep in mind that this list is ten times as big as the one in the original thread that inspired this screed.

So how big must our data sets be in order for the extra overhead of the ST and GRT to worth the reduction in WUD? A benchmark with one thousand strings shows that naive and orcish are nearly unchanged, while the execution time for both the ST and GRT increased by more than four times. (However, the GRT is nearly twice as fast as the ST.)

Further increases in data size would be an exercise in futility. What we've discovered is that performing this simple regex is actually less costly than the significant overhead in setting up the ST and GRT.

The lesson here is that these fancy sorting algorithms are only useful when the operation to construct the sort-key (in this case, extracting the numbers with a regex) is more expensive than the overhead of the fanciness.

So how do we decide? Run benchmarks on everything? Not even necessary. Just ask yourself the following questions:

  1. How big is my dataset likely to be? Unless it's huge, a naive sort is probably the best sort.
  2. How often must this sort run? If it's a once-a-week cronjob, it probably doesn't matter if it takes ten seconds instead of one second. The extra nine seconds are a small price to pay for lots of WUD. On the other hand, if it has to run once every minute, then those nine seconds are important.
  3. Is my dataset likely to grow? If it is, maybe it's worth doing a fancier algorithm now. On the other hand, maybe it's worth keeping the simple one until the data gets big enough to require a fancy one.

In short, a bit of forethought and common-sense can save you a lot of trouble down the road.

Happy sorting.
- friedo

Update: Thanks to Pustular and ikegami for pointing out the Benchmark mistake. (I had forgotten that sort does a no-op if in void context.) After re-testing I've found that the Schwartzian becomes more efficient than the naive sort for this operation for lists bigger than 200 items, and the GRT at 120 on my machine. So I was clearly wrong about the extent of the ST and GRT overhead. My point about eliminating complexity for trivial cases still stands, though. :)

In reply to Choosing the right sort (Or: WUDsamatter?) by friedo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (4)
    As of 2018-05-20 14:42 GMT
    Find Nodes?
      Voting Booth?