Re: Sort on Number Embedded in String

With merlyn in the offing it's presumptuous of me to write about the Schwartzian Transform, but here it goes.

As others have pointed out, you don't need the ST to handle this task, but it is a neat trick that one often finds in Perl code, so it's good to become familiar with it in any case.

The ST is just an optimization of the sort operation. It often happens that one wants to sort the elements of an array by some property of the elements other than their alphabetic or numeric order. (In your case, this property is a particular substring, or rather, the number represented by a particular substring.) Suppose that the subroutine property computes this property, and let's say, for the sake of this example, that this property happens to be a number. Then, a naive (ascending) sort would be:

my @sorted = sort { property($a) <=> property($b) } @unsorted;
[download]

This works fine, but it is inefficient, because for every element $x in @unsorted, property($x) is computed in every sort comparison involving $x. That's a lot of redundant computation.

The insight behind the ST is to perform the computation of the sorting property only once for each element of the array. Instead of sorting the elements of the original array, we sort "transformed" elements, each consisting of an original element (the "payload") and its property (the "key" or "index") packaged in an anonymous array. I.e. we go from sorting this:

($foo, $bar, $baz, ...)
[download]

with the sorting function

{ property($a) <=> property($b) }
[download]

to sorting this

([$foo, property($foo)], [$bar, property($bar)], [$baz, property($baz)
+], ... )
[download]

with the sorting function

{ $a->[1] <=> $b->[1] }
[download]

Now, for each element in the array, instead of computing its property many times, we only have to dereference the transformed element (an array ref), which is almost always a much cheaper operation.

Once the transformed array is sorted, we recover the desired sorted array by pulling out the original elements from the corresponding transformed elements.

Here is what this strategy looks like when applied to your problem:

my @st        = map  { [ $_, property($_) ] } @files;
my @sorted_st = sort { $a->[1] <=> $b->[1] }  @st;
my @sorted    = map  { $_->[0] }              @sorted_st;

sub property { ( $_[0] =~ /fwlog\.(\d+)\w+/ )[ 0 ] };
[download]

The first line generates the transformed array, whose elements each consists of a "payload" (the original element) and its sorting "property". The second line sorts the transformed array. The third line recovers the "payloads" from the sorted transformed array, to produce the desired sorted array. Actually, notice that only the definition of property is truly specific to your problem; the other lines are pretty generic, modulo sort order and whether the sort is numeric or alphabetic.

That's all there is to it, although there a couple more comments worth making about this. One is that often programmers condense this procedure significantly by "pipelining" the three lines above, and building the property subroutine right into the first call to map. I.e., the above would be condensed to

my @sorted =
  map  { $_->[0] }
  sort { $a->[1] <=> $b->[1] }
  map  { [ $_, ( $_[0] =~ /fwlog\.(\d+)\w+/ )[ 0 ] ] }
  @files;
[download]

The second point is that it often happens that the sorting needs to be done on several properties: to order two objects whose first property is the same, we order by the second property; if the second property is also the same for both, we order by the third property, etc. In such a situation, it is useful to use a property that returns an array of all the sorting properties, and then the sort line of the above would look something like this:

  sort { $a->[1] <=> $b->[1] or
         $a->[2] cmp $b->[2] or
         $b->[3] <=> $a->[3] }
[download]

In this particular example, the sorting is ascending/numerical by the first property, ascending/alphabetical by the second property, and descending/numerical by the third property.

A third point, which is just a repetition of other comments, is that it is not a foregone conclusion that the ST is going to yield an improvement in speed, or an improvement in speed noticeable enough to make the added complexity of the code worthwhile.

The Guttman-Rosler paper cited in another comment describes the ST on its way to presenting a significantly less straightforward, but more powerful, optimization, the Guttman-Rosler transform. It is nowhere near as handy or common as the ST, but if a sort is your program's bottleneck it's worth trying it out.

the lowliest monk

Update: Added "third point".

Comment on Re: Sort on Number Embedded in String Select or Download Code


Syntactic Confectionery Delight
	PerlMonks