RE: Schwartzian Transform

in reply to Schwartzian Transform

I thought I would add to this snippet since I had to stare at the code for far to long before I figured out how I might use it.

Okay so we know it is for sorting, but why is it useful? It is a heck of a lot faster for sorting large data sets where the data has to be manipulated before sorting. The example I came up with is this:

We have a huge list of names, and we want to sort them alphabetically by last name, first name, then middle name. That doesnt sound too bad. But the problem is that our database does not store last names, first names or middle names seperately. Instead there is just a single name field like "Arthur C. Clarke". So when we get all the data from the database first we have to put the string in alphapbetical sort order (ie make "Arthur C. Clarke" into "Clarke Arthur C.") then sort it.

Someone without Randal Schwartz's insight (namely me) would have done something like this:

my @sorted = sort mysort @unsorted;
sub mysort
{
    $a =~ m/(.*?)\s*(\S+)$/;
    $aa = "$2 $1";
    $b =~ m/(.*?)\s*(\S+)$/;
    $bb = "$2 $1";
    return $aa cmp $bb;
}
[download]

Now this would be bad for large data sets because each element in @unsorted will get sent to mysort several time, so we are then performing the match several times where we dont really need to.

So in order to only perform the match once we go through the array and replaces each element with an array reference containing the original element (dont loose the original!) and the new sortable string. So the first step:

@unsorted = ("Larry Wall", "Arthur C. Clarke");
@a = map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
#
# now @a is (["Larry Wall", "Wall Larry"], 
#            ["Arthur C. Clarke", "Clarke Arthur C."])
#
[download]

Now we do a simple sort on the second element in @a: @b = sort {$a->[1] cmp $b->[1]} @a; And finally we turn the 2D array back into a sorted 1D array restoring the original values: @sorted = map {$_->[0]} @b; If you cascade the function calls, then you remove the intermediate steps of my @a, and @b and you get the Schwartzian Transform:

my @sorted = 
           map{ $_->[0] }
           sort {$a->[1] cmp $b->[1]}
           map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] }
           @unsorted;
[download]

Now that is pretty cool stuff, but how much of a time savings is this extra work? Well, I did a small benchmarking test to mimic what a time savings would under a large data set:

#!/usr/bin/perl
use Benchmark;

@unsorted = ("Larry Wall",
             "Jane Sally Doe",
             "John Doe",
             "Morphius",
             "Jane Alice Doe",
             "Arthur C. Clarke");
      
timethese(100000, {
    "schwartzian" => sub { my @sorted = 
                                map{ $_->[0] }
                                sort {$a->[1] cmp $b->[1]}
                                map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" 
+] }
                                @unsorted },
    "sort routine" => sub { my @sorted = sort mysort @unsorted }
});

sub mysort
{
    $a =~ m/(.*?)\s*(\S+)$/;
    $aa = "$2 $1";
    $b =~ m/(.*?)\s*(\S+)$/;
    $bb = "$2 $1";
    return $aa cmp $bb;
}
[download]

And the results:

Benchmark: timing 100000 iterations of schwartzian, sort routine...
schwartzian: 57 wallclock secs (53.66 usr +  0.30 sys = 53.96 CPU)
sort routine: 124 wallclock secs (122.48 usr +  0.41 sys = 122.89 CPU)
[download]

I hope this helps.

In Section Cool Uses for Perl