Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Resorting to Sorting

by japhy (Canon)
on Nov 30, 2001 at 23:48 UTC ( [id://128722]=perltutorial: print w/replies, xml ) Need Help??


NAME

Resorting to Sorting


SYNOPSIS

A guide to using Perl's sort() function to sort data in numerous ways. Topics covered include the Orcish maneuver, the Schwartzian Transform, the Guttman-Rosler Transform, radix sort, and sort-by-index.


DESCRIPTION

Sorting data is a common procedure in programming -- there are efficient and inefficient ways to do this. Luckily, in Perl, the sort() function does the dirty work; Perl's sorting is handled internally by a combination of merge-sort and quick-sort. However, sorting is done, by default, on strings. In order to change the way this is done, you can supply a piece of code to the sort() function that describes the machinations to take place.

We'll examine all differents sorts of sorts; some have been named after programmers you may have heard of, and some have more descriptive names.


CONTENT

Table of Contents

  1. Naïve Sorting
    Poor practices that cause Perl to do a lot more work than necessary.

  2. The Orcish Maneuver
    Joseph Hall's implementation of "memoization" in sorting.

  3. Radix Sort
    A multiple-pass method of sorting; the time it takes to run is linearly proportional to the size of the largest element.

  4. Sorting by Index
    When multiple arrays must be sorted in parallel, save yourself trouble and sort the indices.

  5. Schwartzian Transforms
    Wrapping a sort() in between two map()s -- one to set up a data structure, and the other to extract the original information -- is a nifty way of sorting data quickly, when expensive function calls need to be kept to a minimum.

  6. Guttman-Rosler Transforms
    It's far simpler to let sort() sort as it will, and to format your data as something meaningful to the string comparisons sort() makes.

  7. Portability
    By giving sorting functions a prototype, you can make sure they work from anywhere!

Naïve Sorting

Ordinarily, it's not a difficult task to sort things. You merely pass the list to sort(), and out comes a sorted list. Perl defaults to using a string comparison, offered by the cmp operator. This operator compares two scalars in ASCIIbetical order -- that means "1" comes before "A", which comes before "^", which comes before "a". For a detailed list of the order, see your nearest ascii(1) man page.

To sort numerically, you need to supply sort() that uses the numerical comparison operator (dubbed the "spaceship" operator), <=>:

@sorted = sort { $a <=> $b } @numbers; # ascending order @sorted = sort { $b <=> $a } @numbers; # descending order

There are two special variables used in sorting -- $a and $b. These represent the two elements being compared at the moment. The sorting routine can take a block (or a function name) to use in deciding which order the list is to be sorted in. The block or function should return -1 if $a is to come before $b, 0 if they are the same (or, more correctly, if their position in the sorted list could be the same), and 1 if $a is to come after $b.

Sorting, by default, is like:

@sorted = sort { $a cmp $b } @unsorted;

That is, ascending ASCIIbetical sorting. You can leave out the block in that case:

@sorted = sort @unsorted;

Now, if we had a list of strings, and we wanted to sort them, in a case-insensitive manner. That means, we want to treat the strings as if they were all lower-case or upper-case. We could do something like:

@sorted = sort { lc($a) cmp lc($b) } @unsorted; # or @sorted = sort { uc($a) cmp uc($b) } @unsorted;
Note: There is a difference between these two sortings. There are some punctuation characters that come after upper-case letters and before lower-case characters. Thus, strings that start with such characters would be placed differently in the sorted list, depending on whether we use lc() or uc().

Now, this method of sorting is fine for small lists, but the lc() (or uc()) function is called twice for each comparison. This might not seem bad, but think about the consequences of performing massive calculations on your data:

sub age_or_name { my ($name_a, $age_a) = split /_/ => $a; my ($name_b, $age_b) = split /_/ => $b; return ($age_a <=> $age_b or $name_a cmp $name_b); } @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 ); @sorted = sort age_or_name @people; # @sorted is now # qw( Jon_14 Tim_14 Ray_18 Greg_19 Jeff_19 Joan_20 )

This gets to be tedious. There's obviously too much work being done. We should only have to split the strings once.

Exercises

  1. Create a sorting subroutine to sort by the length of a string, or, if needed, by its first five characters.
    @sorted = sort { ... } @strings;

  2. Sort the following data structure by the value of the key specified by the "cmp" key:
    @nodes = ( { id => 17, size => 300, keys => 2, cmp => 'keys' }, { id => 14, size => 104, keys => 9, cmp => 'size' }, { id => 31, size => 2045, keys => 43, cmp => 'keys' }, { id => 28, size => 6, keys => 0, cmp => 'id' }, );

The Orcish Maneuver

This method of speeding up sorting comparisons was named by Joseph Hall. It uses a hash to cache values of the complex calculations you need to make:

{ my %cache; # cache hash is only seen by this function sub age_or_name { my $data_a = ($cache{$a} ||= [ split /_/ => $a ]); my $data_b = ($cache{$b} ||= [ split /_/ => $b ]); return ( $data_a->[1] <=> $data_b->[1] or $data_a->[0] <=> $data_b->[0] ); } } @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 ); @sorted = sort age_or_name @people;

This procedure here uses a hash of array references to store the name and age for each person. The first time a string is used in the sorting subroutine, it doesn't have an entry in the %cache hash, so the || part is used.

That is where this gets its name -- it is the OR-cache manuever, which can be lovingly pronounced "orcish".

The main structure of Orcish sorting is:

{ my %cache; sub function { my $data_a = ($cache{$a} ||= mangle($a)); my $data_b = ($cache{$b} ||= mangle($b)); # compare as needed } }

where mangle() is some function that does the necessary calculations on the data.

Exercises

  1. Why should you make the caching hash viewable only by the sorting function? And how is this accomplished?

  2. Use the Orcish Manuever to sort a list of strings in the same way as described in the first exercise from "Naïve Sorting".

Radix Sort

If you have a set of strings of constant width (or that can easily be made in constant width), you can employ radix sort. This method gets around calling Perl's sort() function altogether.

The concept of radix sort is as follows. Assume you have N strings of k characters in length, and each character can have one of x values (for ASCII, x is 256). We then create x "buckets", and each bucket can hold at most N strings.

Here is a sample list of data for N = 7, k = 4, and x = 256: john, bear, roar, boat, vain, vane, zany.

We then proceed to place each string into the bucket corresponding to the ASCII value of its rightmost character. If we were then to print the contents of the buckets after this first placement, our sample list would look like: vane, john vain bear roar boat zany.

Then, we use the character immediately to the left of the one just used, and put the strings in the buckets accordingly. This is done in the order in which they are found in the buckets. The new list is: bear, roar, boat, john, vain, vane, zany.

On the next round, the list becomes: vain, vane, zany, bear, roar, boat, john.

On the final round, the list is: bear, boat, john, roar, vain, vane, zany.

This amount of time this sorting takes is constant, and easily calculated. If we assume that all the data is the same length, then we take N strings, and multiply that by k characters. The algorithm also uses some extra space for storing the strings -- it needs an extra Nk bytes. If the data needs to be padded, there is some extra time involved (if a character is undefined, it is set as a NUL ("\0")).

Here is a radix implementation. It returns the list it is given in ASCIIbetical order, like sort @list would.

# sorts in-place (meaning @list gets changed) # set $unknown to true to indicate variable length radix_sort(\@list, $unknown);
sub radix_sort { my ($data, $k) = @_; $k = !!$k; # turn any true value into 1 if ($k) { $k < length and $k = length for @$data } else { $k = length $data->[0] } while ($k--) { my @buckets; for (@$data) { my $c = substr $_, $k, 1; # get char $c = "\0" if not defined $c; push @{ $buckets[ord($c)] }, $_; } @$data = map @$_, @buckets; # expand array refs } }

You'll notice the first argument to this function is an array reference. By doing this, we save copying a potentially large list, thus taking up less space, and running faster. If, for beauty reasons, you'd prefer not to backslash your array, you could use prototypes:

sub radix_sort (\@;$); radix_sort @list, $unknown; sub radix_sort (\@;$) { # ... }

You could combine the declaration and the definition of the function, but the prototype must be seen before the function call.

Exercises

  1. Why does radix sort start with the right-most character in a string?

  2. Does the order of the elements in the input list effect the run-time of this sorting algorithm? What happens if the elements are already sorted? Or in the reverse sorted order?

Sorting by Index

Given the choice between sorting three lists and sorting one list, you'd choose sorting one list, right? Good. This, then, is the strategy employed when you sort by index. If you have three arrays that hold different information, yet for a given index, the elements are all related -- we say these arrays hold data in parallel -- then it seems far too much work to sort all three arrays.

@names = qw( Jeff Jon Ray Tim Joan Greg ); @ages = qw( 19 14 18 14 20 19 ); @gender = qw( m m m m f m );

Here, all the data at index 3 ("Tim", 14, "m") is related, as it is for any other index. Now, if we wanted to sort these arrays so that this relationship stood, but the lists were sorted in terms of age and then by name, then we would like our data to look like:

@names = qw( Jon Tim Ray Greg Jeff Joan ); @ages = qw( 14 14 18 19 19 20 ); @gender = qw( m m m m m f );

But to actually sort these lists requires 3 times the effort. Instead, we will sort the indices of the arrays (from 0 to 5). This is the function we will use:

sub age_or_name { return ( $ages[$a] <=> $ages[$b] or $names[$a] cmp $names[$b] ) }

And here it is in action:

@idx = sort age_or_name 0 .. $#ages; print "@ages\n"; # 19 14 18 14 20 19 print "@idx\n"; # 1 3 2 5 0 4 print "@ages[@idx]\n"; # 14 14 18 19 19 20

As you can see, the array isn't touched, but the indices are given in such an order that fetching the elements of the array in that order yields sorted data.

Note: the $#ages variable is related to the @ages array -- it holds the highest index used in the array, so for an array of 6 elements, $#array is 5.

Schwartzian Transforms

A common (and rather popular) idiom in Perl programming is the Schwartzian Transform, an approach which is like "you set 'em up, I'll knock 'em down!" It uses the map() function to transform the incoming data into a list of simple data structures. This way, the machinations done to the data set are only done once (as in the Orcish Manuever).

The general appearance of the transform is like so:

@sorted = map { get_original_data($_) } sort { ... } map { transform_data($_) } @original;

They are to be read in reverse order, since the first thing done is the map() that transforms the data, then the sorting, and then the map() to get the original data back.

Let's say you had lines of a password file that were formatted as:

username:password:shell:name:dir

and you wanted to sort first by shell, then by name, and then by username. A Schwartzian Transform could be used like this:

@sorted = map { $_->[0] } sort { $a->[3] cmp $b->[3] or $a->[4] cmp $b->[4] or $a->[1] cmp $b->[1] } map { [ $_, split /:/ ] } @entries;

We'll break this down into the individual parts.

Step 1. Transform your data.
We create a list of array references; each reference holds the original record, and then each of the fields (as separated by colons).
@transformed = map { [ $_, split /:/ ] } @entries;

That could be written in a for-loop, but understanding map() is a powerful tool in Perl.

for (@entries) { push @transformed, [ $_, split /:/ ]; }

Step 2. Sort your data.
Now, we sort on the needed fields. Since the first element of our references is the original string, the username is element 1, the name is element 4, and the shell is element 3.
@transformed = sort { $a->[3] cmp $b->[3] or $a->[4] cmp $b->[4] or $a->[1] cmp $b->[1] } @transformed;

Step 3. Restore your original data.
Finally, get the original data back from the structure:
@sorted = map { $_->[0] } @transformed;

And that's all there is to it. It may look like a daunting structure, but it is really just three Perl statements strung together.

Guttman-Rosler Transforms

Perl's regular sorting is very fast. It's optimized. So it'd be nice to be able to use it whenever possible. That is the foundation of the Guttman-Rosler Transform, called the GRT, for short.

The frame of a GRT is:

@sorted = map { restore($_) } sort map { normalize($_) } @original;

An interesting application of the GRT is to sort strings in a case-insensitive manner. First, we have to find the longest run of NULs in all the strings (for a reason you'll soon see).

my $nulls = 0; # find length of longest run of NULs for (@original) { for (/(\0+)/g) { $nulls = length($1) if length($1) > $nulls; } }
$NUL = "\0" x ++$nulls;

Now, we have a string of nulls, whose length is one greater than the largest run of nulls in the strings. This will allow us to safely separate the lower-case version of the strings from the original strings:

# "\L...\E" is like lc(...) @normalized = map { "\L$_\E$NUL$_" } @original;

Now, we can just send this to sort.

@sorted = sort @normalized;

And then to get back the data we had before, we split on the nulls:

@sorted = map { (split /$NUL/)[1] } @original;

Putting it all together, we have:

# implement our for loop from above # as a function $NUL = get_nulls(\@original); @sorted = map { (split /$NUL/)[1] } sort map { "\L$_\E$NUL$_" } @original;

The reason we use the NUL character is because it has an ASCII value of 0, so it's always less than or equal to any other character. Another way to approach this is to pad the string with nulls so they all become the same length:

# see Exercise 1 for this function $maxlen = maxlen(\@original);
@sorted = map { substr($_, $maxlen) } sort map { lc($_) . ("\0" x ($maxlen - length)) . $_ } @original;

Common functions used in a GRT are pack(), unpack(), and substr(). The goal of a GRT is to make your data presentable as a string that will work in a regular comparison.

Exercises

  1. Write the maxlen() function for the previous chunk of code.

Portability

You can make a function to be used by sort() to avoid writing potentially messy sorting code inline. For example, our Schwartzian Transform:

@sorted = { $a->[3] cmp $b->[3] or $a->[4] cmp $b->[4] or $a->[1] cmp $b->[1] }

However, if you want to declare that function in one package, and use it in another, you run into problems.

#!/usr/bin/perl -w
package Sorting;
sub passwd_cmp { $a->[3] cmp $b->[3] or $a->[4] cmp $b->[4] or $a->[1] cmp $b->[1] }
sub case_insensitive_cmp { lc($a) cmp lc($b) }
package main;
@strings = sort Sorting::case_insensitive_cmp qw( this Mine yours Those THESE nevER );
print "<@strings>\n";
__END__ <this Mine yours Those THESE nevER>

This code doesn't change the order of the strings. The reason is because $a and $b in the sorting subroutine belong to Sorting::, but the $a and $b that sort() is making belong to main::.

To get around this, you can give the function a prototype, and then it will be passed the two arguments.

#!/usr/bin/perl -w
package Sorting;
sub passwd_cmp ($$) { local ($a, $b) = @_; $a->[3] cmp $b->[3] or $a->[4] cmp $b->[4] or $a->[1] cmp $b->[1] }
sub case_insensitive_cmp ($$) { local ($a, $b) = @_; lc($a) cmp lc($b) }
package main;
@strings = sort Sorting::case_insensitive_cmp qw( this Mine yours Those THESE nevER );
print "<@strings>\n";
__END__ <Mine nevER THESE this Those yours>


AUTHOR

Jeff japhy Pinyan, japhy@perlmonk.org

http://japhy.perlmonk.org/

Replies are listed 'Best First'.
Re (tilly) 1: Resorting to Sorting
by tilly (Archbishop) on Dec 01, 2001 at 04:00 UTC
    Very nice. I would only add a few more.

    If you are working with a database, it is often worthwhile to move the sort out of Perl and into an SQL statement. Likewise when handling files in a Unix environment it is sometimes best to just resort to the Unix sort utility. Another external "sort this file" module to play with is File::Sort. And I have a personal fondness for using the BTREE option of DB_File to maintain a sorted data structure.

    These are all worth knowing about, but moving sorts into SQL is particularly effective for saving headaches. While sorting the output yourself may be a fun demonstration of complex Perl manipulations, the database is likely to be faster, and (particularly if you are sorting on dates!) is likely to save headaches.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://128722]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-10-03 14:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (42 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.