|Problems? Is your data what you think it is?|
Resorting to Sorting
A guide to using Perl's sort() function to sort data in numerous ways. Topics covered include the Orcish maneuver, the Schwartzian Transform, the Guttman-Rosler Transform, radix sort, and sort-by-index.
Sorting data is a common procedure in programming -- there are efficient and inefficient ways to do this. Luckily, in Perl, the sort() function does the dirty work; Perl's sorting is handled internally by a combination of merge-sort and quick-sort. However, sorting is done, by default, on strings. In order to change the way this is done, you can supply a piece of code to the sort() function that describes the machinations to take place.
We'll examine all differents sorts of sorts; some have been named after programmers you may have heard of, and some have more descriptive names.
Ordinarily, it's not a difficult task to sort things. You merely pass the list to sort(), and out comes a sorted list. Perl defaults to using a string comparison, offered by the cmp operator. This operator compares two scalars in ASCIIbetical order -- that means "1" comes before "A", which comes before "^", which comes before "a". For a detailed list of the order, see your nearest ascii(1) man page.
To sort numerically, you need to supply sort() that uses the numerical comparison operator (dubbed the "spaceship" operator), <=>:
There are two special variables used in sorting -- $a and $b. These represent the two elements being compared at the moment. The sorting routine can take a block (or a function name) to use in deciding which order the list is to be sorted in. The block or function should return -1 if $a is to come before $b, 0 if they are the same (or, more correctly, if their position in the sorted list could be the same), and 1 if $a is to come after $b.
Sorting, by default, is like:
That is, ascending ASCIIbetical sorting. You can leave out the block in that case:
Now, if we had a list of strings, and we wanted to sort them, in a case-insensitive manner. That means, we want to treat the strings as if they were all lower-case or upper-case. We could do something like:
Now, this method of sorting is fine for small lists, but the lc() (or uc()) function is called twice for each comparison. This might not seem bad, but think about the consequences of performing massive calculations on your data:
This gets to be tedious. There's obviously too much work being done. We should only have to split the strings once.
This method of speeding up sorting comparisons was named by Joseph Hall. It uses a hash to cache values of the complex calculations you need to make:
This procedure here uses a hash of array references to store the name and age for each person. The first time a string is used in the sorting subroutine, it doesn't have an entry in the %cache hash, so the || part is used.
That is where this gets its name -- it is the OR-cache manuever, which can be lovingly pronounced "orcish".
The main structure of Orcish sorting is:
where mangle() is some function that does the necessary calculations on the data.
If you have a set of strings of constant width (or that can easily be made in constant width), you can employ radix sort. This method gets around calling Perl's sort() function altogether.
The concept of radix sort is as follows. Assume you have N strings of k characters in length, and each character can have one of x values (for ASCII, x is 256). We then create x "buckets", and each bucket can hold at most N strings.
Here is a sample list of data for N = 7, k = 4, and x = 256: john, bear, roar, boat, vain, vane, zany.
We then proceed to place each string into the bucket corresponding to the ASCII value of its rightmost character. If we were then to print the contents of the buckets after this first placement, our sample list would look like: vane, john vain bear roar boat zany.
Then, we use the character immediately to the left of the one just used, and put the strings in the buckets accordingly. This is done in the order in which they are found in the buckets. The new list is: bear, roar, boat, john, vain, vane, zany.
On the next round, the list becomes: vain, vane, zany, bear, roar, boat, john.
On the final round, the list is: bear, boat, john, roar, vain, vane, zany.
This amount of time this sorting takes is constant, and easily calculated. If we assume that all the data is the same length, then we take N strings, and multiply that by k characters. The algorithm also uses some extra space for storing the strings -- it needs an extra Nk bytes. If the data needs to be padded, there is some extra time involved (if a character is undefined, it is set as a NUL ("\0")).
Here is a radix implementation. It returns the list it is given in ASCIIbetical order, like sort @list would.
You'll notice the first argument to this function is an array reference. By doing this, we save copying a potentially large list, thus taking up less space, and running faster. If, for beauty reasons, you'd prefer not to backslash your array, you could use prototypes:
You could combine the declaration and the definition of the function, but the prototype must be seen before the function call.
Given the choice between sorting three lists and sorting one list, you'd choose sorting one list, right? Good. This, then, is the strategy employed when you sort by index. If you have three arrays that hold different information, yet for a given index, the elements are all related -- we say these arrays hold data in parallel -- then it seems far too much work to sort all three arrays.
Here, all the data at index 3 ("Tim", 14, "m") is related, as it is for any other index. Now, if we wanted to sort these arrays so that this relationship stood, but the lists were sorted in terms of age and then by name, then we would like our data to look like:
But to actually sort these lists requires 3 times the effort. Instead, we will sort the indices of the arrays (from 0 to 5). This is the function we will use:
And here it is in action:
As you can see, the array isn't touched, but the indices are given in such an order that fetching the elements of the array in that order yields sorted data.
A common (and rather popular) idiom in Perl programming is the Schwartzian Transform, an approach which is like "you set 'em up, I'll knock 'em down!" It uses the map() function to transform the incoming data into a list of simple data structures. This way, the machinations done to the data set are only done once (as in the Orcish Manuever).
The general appearance of the transform is like so:
They are to be read in reverse order, since the first thing done is the map() that transforms the data, then the sorting, and then the map() to get the original data back.
Let's say you had lines of a password file that were formatted as:
and you wanted to sort first by shell, then by name, and then by username. A Schwartzian Transform could be used like this:
We'll break this down into the individual parts.
And that's all there is to it. It may look like a daunting structure, but it is really just three Perl statements strung together.
Perl's regular sorting is very fast. It's optimized. So it'd be nice to be able to use it whenever possible. That is the foundation of the Guttman-Rosler Transform, called the GRT, for short.
The frame of a GRT is:
An interesting application of the GRT is to sort strings in a case-insensitive manner. First, we have to find the longest run of NULs in all the strings (for a reason you'll soon see).
Now, we have a string of nulls, whose length is one greater than the largest run of nulls in the strings. This will allow us to safely separate the lower-case version of the strings from the original strings:
Now, we can just send this to sort.
And then to get back the data we had before, we split on the nulls:
Putting it all together, we have:
The reason we use the NUL character is because it has an ASCII value of 0, so it's always less than or equal to any other character. Another way to approach this is to pad the string with nulls so they all become the same length:
Common functions used in a GRT are pack(), unpack(), and substr(). The goal of a GRT is to make your data presentable as a string that will work in a regular comparison.
You can make a function to be used by sort() to avoid writing potentially messy sorting code inline. For example, our Schwartzian Transform:
However, if you want to declare that function in one package, and use it in another, you run into problems.
This code doesn't change the order of the strings. The reason is because $a and $b in the sorting subroutine belong to Sorting::, but the $a and $b that sort() is making belong to main::.
To get around this, you can give the function a prototype, and then it will be passed the two arguments.
Jeff japhy Pinyan, firstname.lastname@example.org