<?xml version="1.0" encoding="windows-1252"?>
<node id="128722" title="Resorting to Sorting" created="2001-11-30 18:48:13" updated="2005-08-15 11:15:05">
<type id="956">
perltutorial</type>
<author id="1936">
japhy</author>
<data>
<keywords>
<keyword rating="">
sorting</keyword>
</keywords>
<field name="doctext">
&lt;!-- INDEX BEGIN --&gt;

&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#name"&gt;NAME&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#synopsis"&gt;SYNOPSIS&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#description"&gt;DESCRIPTION&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#content"&gt;CONTENT&lt;/A&gt;&lt;/LI&gt;
&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#table of contents"&gt;Table of Contents&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#nave sorting"&gt;Na&amp;iuml;ve Sorting&lt;/A&gt;&lt;/LI&gt;
&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#exercises"&gt;Exercises&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;LI&gt;&lt;A HREF="#the orcish maneuver"&gt;The Orcish Maneuver&lt;/A&gt;&lt;/LI&gt;
&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#exercises"&gt;Exercises&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;LI&gt;&lt;A HREF="#radix sort"&gt;Radix Sort&lt;/A&gt;&lt;/LI&gt;
&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#exercises"&gt;Exercises&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;LI&gt;&lt;A HREF="#sorting by index"&gt;Sorting by Index&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#schwartzian transforms"&gt;Schwartzian Transforms&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A HREF="#guttmanrosler transforms"&gt;Guttman-Rosler Transforms&lt;/A&gt;&lt;/LI&gt;
&lt;UL&gt;

&lt;LI&gt;&lt;A HREF="#exercises"&gt;Exercises&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;LI&gt;&lt;A HREF="#portability"&gt;Portability&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;LI&gt;&lt;A HREF="#author"&gt;AUTHOR&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;!-- INDEX END --&gt;

&lt;HR&gt;
&lt;P&gt;
&lt;H1&gt;&lt;A NAME="name"&gt;NAME&lt;/A&gt;&lt;/H1&gt;
&lt;P&gt;Resorting to Sorting&lt;/P&gt;
&lt;P&gt;
&lt;HR&gt;
&lt;H1&gt;&lt;A NAME="synopsis"&gt;SYNOPSIS&lt;/A&gt;&lt;/H1&gt;
&lt;P&gt;A guide to using Perl's &lt;TT&gt;sort()&lt;/TT&gt; function to sort data in numerous ways.
Topics covered include the Orcish maneuver, the Schwartzian Transform,
the Guttman-Rosler Transform, radix sort, and sort-by-index.&lt;/P&gt;
&lt;P&gt;
&lt;HR&gt;
&lt;H1&gt;&lt;A NAME="description"&gt;DESCRIPTION&lt;/A&gt;&lt;/H1&gt;
&lt;P&gt;Sorting data is a common procedure in programming -- there are efficient
and inefficient ways to do this.  Luckily, in Perl, the &lt;TT&gt;sort()&lt;/TT&gt; function
does the dirty work; Perl's sorting is handled internally by a combination
of merge-sort and quick-sort.  However, sorting is done, by default, on
strings.  In order to change the way this is done, you can supply a piece
of code to the &lt;TT&gt;sort()&lt;/TT&gt; function that describes the machinations to take
place.&lt;/P&gt;
&lt;P&gt;We'll examine all differents sorts of sorts; some have been named after
programmers you may have heard of, and some have more descriptive names.&lt;/P&gt;
&lt;P&gt;
&lt;HR&gt;
&lt;H1&gt;&lt;A NAME="content"&gt;CONTENT&lt;/A&gt;&lt;/H1&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="table of contents"&gt;Table of Contents&lt;/A&gt;&lt;/H2&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Nave_Sorting"&gt;Na&amp;iuml;ve Sorting&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

Poor practices that cause Perl to do a lot more work than necessary.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_The_Orcish_Maneuver"&gt;The Orcish Maneuver&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

Joseph Hall's implementation of "memoization" in sorting.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Radix_Sort"&gt;Radix Sort&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

A multiple-pass method of sorting; the time it takes to run is linearly
proportional to the size of the largest element.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Sorting_by_Index"&gt;Sorting by Index&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

When multiple arrays must be sorted in parallel, save yourself trouble
and sort the indices.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Schwartzian_Transforms"&gt;Schwartzian Transforms&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

Wrapping a &lt;TT&gt;sort()&lt;/TT&gt; in between two &lt;TT&gt;map()&lt;/TT&gt;s -- one to set up a data
structure, and the other to extract the original information -- is a
nifty way of sorting data quickly, when expensive function calls need
to be kept to a minimum.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Guttman%2DRosler_Transforms"&gt;Guttman-Rosler Transforms&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

It's far simpler to let &lt;TT&gt;sort()&lt;/TT&gt; sort as it will, and to format your
data as something meaningful to the string comparisons &lt;TT&gt;sort()&lt;/TT&gt; makes.
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;A NAME="item_Portability"&gt;Portability&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;

By giving sorting functions a prototype, you can make sure they work from
anywhere!
&lt;P&gt;&lt;/P&gt;&lt;/OL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="nave sorting"&gt;Na&amp;iuml;ve Sorting&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;Ordinarily, it's not a difficult task to sort things.  You merely pass the list
to &lt;TT&gt;sort()&lt;/TT&gt;, and out comes a sorted list.  Perl defaults to using a string
comparison, offered by the &lt;TT&gt;cmp&lt;/TT&gt; operator.  This operator compares two scalars
in ASCIIbetical order -- that means "1" comes before "A", which comes before
"^", which comes before "a".  For a detailed list of the order, see your nearest
&lt;EM&gt;ascii&lt;/EM&gt;(1) man page.&lt;/P&gt;
&lt;P&gt;To sort numerically, you need to supply &lt;TT&gt;sort()&lt;/TT&gt; that uses the numerical
comparison operator (dubbed the "spaceship" operator), &lt;TT&gt;&lt;=&gt;&lt;/TT&gt;:&lt;/P&gt;
&lt;C&gt;
  @sorted = sort { $a &lt;=&gt; $b } @numbers;  # ascending order
  @sorted = sort { $b &lt;=&gt; $a } @numbers;  # descending order&lt;/C&gt;
&lt;P&gt;There are two special variables used in sorting -- &lt;TT&gt;$a&lt;/TT&gt; and &lt;TT&gt;$b&lt;/TT&gt;.  These
represent the two elements being compared at the moment.  The sorting routine
can take a block (or a function name) to use in deciding which order the list
is to be sorted in.  The block or function should return -1 if &lt;TT&gt;$a&lt;/TT&gt; is to come
before &lt;TT&gt;$b&lt;/TT&gt;, 0 if they are the same (or, more correctly, if their position in
the sorted list could be the same), and 1 if &lt;TT&gt;$a&lt;/TT&gt; is to come after &lt;TT&gt;$b&lt;/TT&gt;.&lt;/P&gt;
&lt;P&gt;Sorting, by default, is like:&lt;/P&gt;
&lt;C&gt;
  @sorted = sort { $a cmp $b } @unsorted;&lt;/C&gt;
&lt;P&gt;That is, ascending ASCIIbetical sorting.  You can leave out the block in that
case:&lt;/P&gt;
&lt;C&gt;
  @sorted = sort @unsorted;&lt;/C&gt;
&lt;P&gt;Now, if we had a list of strings, and we wanted to sort them, &lt;EM&gt;in a
case-insensitive manner&lt;/EM&gt;.  That means, we want to treat the strings as if they
were all lower-case or upper-case.  We could do something like:&lt;/P&gt;
&lt;C&gt;
  @sorted = sort { lc($a) cmp lc($b) } @unsorted;
  # or
  @sorted = sort { uc($a) cmp uc($b) } @unsorted;&lt;/C&gt;
&lt;DL&gt;
&lt;DT&gt;&lt;DD&gt;
&lt;STRONG&gt;Note:&lt;/STRONG&gt; There is a difference between these two sortings.  There are some
punctuation characters that come after upper-case letters and before lower-case
characters. Thus, strings that &lt;EM&gt;start&lt;/EM&gt; with such characters would be placed
differently in the sorted list, depending on whether we use &lt;TT&gt;lc()&lt;/TT&gt; or &lt;TT&gt;uc()&lt;/TT&gt;.
&lt;P&gt;&lt;/P&gt;&lt;/DL&gt;
&lt;P&gt;Now, this method of sorting is fine for small lists, but the &lt;TT&gt;lc()&lt;/TT&gt; (or &lt;TT&gt;uc()&lt;/TT&gt;)
function is called twice for &lt;EM&gt;each&lt;/EM&gt; comparison.  This might not seem bad, but
think about the consequences of performing massive calculations on your data:&lt;/P&gt;
&lt;C&gt;
  sub age_or_name {
    my ($name_a, $age_a) = split /_/ =&gt; $a;
    my ($name_b, $age_b) = split /_/ =&gt; $b;
    return ($age_a &lt;=&gt; $age_b or $name_a cmp $name_b);
  }
  
  @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 );
  @sorted = sort age_or_name @people;
  # @sorted is now
  #   qw( Jon_14 Tim_14 Ray_18 Greg_19 Jeff_19 Joan_20 )&lt;/C&gt;
&lt;P&gt;This gets to be tedious.  There's obviously too much work being done.  We
should only have to split the strings once.&lt;/P&gt;
&lt;P&gt;
&lt;H3&gt;&lt;A NAME="exercises"&gt;Exercises&lt;/A&gt;&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;
Create a sorting subroutine to sort by the length of a string, or, if needed,
by its first five characters.
&lt;C&gt;
  @sorted = sort { ... } @strings;&lt;/C&gt;
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;
Sort the following data structure by the value of the key specified by the "cmp"
key:
&lt;C&gt;
  @nodes = (
    { id =&gt; 17, size =&gt; 300, keys =&gt; 2, cmp =&gt; 'keys' },
    { id =&gt; 14, size =&gt; 104, keys =&gt; 9, cmp =&gt; 'size' },
    { id =&gt; 31, size =&gt; 2045, keys =&gt; 43, cmp =&gt; 'keys' },
    { id =&gt; 28, size =&gt; 6, keys =&gt; 0, cmp =&gt; 'id' },
  );&lt;/C&gt;
&lt;P&gt;&lt;/P&gt;&lt;/OL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="the orcish maneuver"&gt;The Orcish Maneuver&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;This method of speeding up sorting comparisons was named by Joseph Hall.  It
uses a hash to cache values of the complex calculations you need to make:&lt;/P&gt;
&lt;C&gt;
  {
    my %cache;  # cache hash is only seen by this function
    
    sub age_or_name {
      my $data_a =
        ($cache{$a} ||= [ split /_/ =&gt; $a ]);
      my $data_b =
        ($cache{$b} ||= [ split /_/ =&gt; $b ]);
      return (
        $data_a-&gt;[1] &lt;=&gt; $data_b-&gt;[1]
        or
        $data_a-&gt;[0] &lt;=&gt; $data_b-&gt;[0]
      );
    }
  }
  
  @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 );
  @sorted = sort age_or_name @people;&lt;/C&gt;
&lt;P&gt;This procedure here uses a hash of array references to store the name and age
for each person.  The first time a string is used in the sorting subroutine, it
doesn't have an entry in the &lt;TT&gt;%cache&lt;/TT&gt; hash, so the &lt;TT&gt;||&lt;/TT&gt; part is used.&lt;/P&gt;
&lt;P&gt;That is where this gets its name -- it is the OR-cache manuever, which can be
lovingly pronounced "orcish".&lt;/P&gt;
&lt;P&gt;The main structure of Orcish sorting is:&lt;/P&gt;
&lt;C&gt;
  {
    my %cache;
    
    sub function {
      my $data_a = ($cache{$a} ||= mangle($a));
      my $data_b = ($cache{$b} ||= mangle($b));
      # compare as needed
    }
  }&lt;/C&gt;
&lt;P&gt;where &lt;TT&gt;mangle()&lt;/TT&gt; is some function that does the necessary calculations on the
data.&lt;/P&gt;
&lt;P&gt;
&lt;H3&gt;&lt;A NAME="exercises"&gt;Exercises&lt;/A&gt;&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;
Why should you make the caching hash viewable only by the sorting function?  And
how is this accomplished?
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;
Use the Orcish Manuever to sort a list of strings in the same way as described in
the first exercise from "Na&amp;iuml;ve Sorting".
&lt;P&gt;&lt;/P&gt;&lt;/OL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="radix sort"&gt;Radix Sort&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;If you have a set of strings of constant width (or that can easily be made in
constant width), you can employ &lt;EM&gt;radix sort&lt;/EM&gt;.  This method gets around calling
Perl's &lt;TT&gt;sort()&lt;/TT&gt; function altogether.&lt;/P&gt;
&lt;P&gt;The concept of radix sort is as follows.  Assume you have &lt;EM&gt;N&lt;/EM&gt; strings of &lt;EM&gt;k&lt;/EM&gt;
characters in length, and each character can have one of &lt;EM&gt;x&lt;/EM&gt; values (for ASCII,
&lt;EM&gt;x&lt;/EM&gt; is 256).  We then create &lt;EM&gt;x&lt;/EM&gt; "buckets", and each bucket can hold at most
&lt;EM&gt;N&lt;/EM&gt; strings.&lt;/P&gt;
&lt;P&gt;Here is a sample list of data for &lt;EM&gt;N&lt;/EM&gt; = 7, &lt;EM&gt;k&lt;/EM&gt; = 4, and &lt;EM&gt;x&lt;/EM&gt; = 256: john, bear,
roar, boat, vain, vane, zany.&lt;/P&gt;
&lt;P&gt;We then proceed to place each string into the bucket corresponding to the ASCII
value of its rightmost character.  If we were then to print the contents of the
buckets after this first placement, our sample list would look like:  van&lt;STRONG&gt;e&lt;/STRONG&gt;,
joh&lt;STRONG&gt;n&lt;/STRONG&gt; vai&lt;STRONG&gt;n&lt;/STRONG&gt; bea&lt;STRONG&gt;r&lt;/STRONG&gt; roa&lt;STRONG&gt;r&lt;/STRONG&gt; boa&lt;STRONG&gt;t&lt;/STRONG&gt; zan&lt;STRONG&gt;y&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;Then, we use the character immediately to the left of the one just used, and put
the strings in the buckets accordingly.  This is done in the order in which they
are found in the buckets.  The new list is: be&lt;STRONG&gt;a&lt;/STRONG&gt;r, ro&lt;STRONG&gt;a&lt;/STRONG&gt;r, bo&lt;STRONG&gt;a&lt;/STRONG&gt;t, jo&lt;STRONG&gt;h&lt;/STRONG&gt;n,
va&lt;STRONG&gt;i&lt;/STRONG&gt;n, va&lt;STRONG&gt;n&lt;/STRONG&gt;e, za&lt;STRONG&gt;n&lt;/STRONG&gt;y.&lt;/P&gt;
&lt;P&gt;On the next round, the list becomes: v&lt;STRONG&gt;a&lt;/STRONG&gt;in, v&lt;STRONG&gt;a&lt;/STRONG&gt;ne, z&lt;STRONG&gt;a&lt;/STRONG&gt;ny, b&lt;STRONG&gt;e&lt;/STRONG&gt;ar, r&lt;STRONG&gt;o&lt;/STRONG&gt;ar,
b&lt;STRONG&gt;o&lt;/STRONG&gt;at, j&lt;STRONG&gt;o&lt;/STRONG&gt;hn.&lt;/P&gt;
&lt;P&gt;On the final round, the list is: &lt;STRONG&gt;b&lt;/STRONG&gt;ear, &lt;STRONG&gt;b&lt;/STRONG&gt;oat, &lt;STRONG&gt;j&lt;/STRONG&gt;ohn, &lt;STRONG&gt;r&lt;/STRONG&gt;oar, &lt;STRONG&gt;v&lt;/STRONG&gt;ain,
&lt;STRONG&gt;v&lt;/STRONG&gt;ane, &lt;STRONG&gt;z&lt;/STRONG&gt;any.&lt;/P&gt;
&lt;P&gt;This amount of time this sorting takes is constant, and easily calculated.  If
we assume that all the data is the same length, then we take &lt;EM&gt;N&lt;/EM&gt; strings, and
multiply that by &lt;EM&gt;k&lt;/EM&gt; characters.  The algorithm also uses some extra space for
storing the strings -- it needs an extra &lt;EM&gt;Nk&lt;/EM&gt; bytes.  If the data needs to be
padded, there is some extra time involved (if a character is undefined, it is
set as a NUL ("\0")).&lt;/P&gt;
&lt;P&gt;Here is a radix implementation.  It returns the list it is given in ASCIIbetical
order, like &lt;TT&gt;sort @list&lt;/TT&gt; would.&lt;/P&gt;
&lt;C&gt;
  # sorts in-place (meaning @list gets changed)
  # set $unknown to true to indicate variable length
  radix_sort(\@list, $unknown);&lt;/C&gt;
&lt;C&gt;
  sub radix_sort {
    my ($data, $k) = @_;
    $k = !!$k;  # turn any true value into 1
    
    if ($k) { $k &lt; length and $k = length for @$data }
    else { $k = length $data-&gt;[0] }
    
    while ($k--) {
      my @buckets;
      for (@$data) {
        my $c = substr $_, $k, 1;  # get char
        $c = "\0" if not defined $c;
        push @{ $buckets[ord($c)] }, $_;
      }
    
      @$data = map @$_, @buckets;  # expand array refs
    }
  }&lt;/C&gt;
&lt;P&gt;You'll notice the first argument to this function is an array &lt;EM&gt;reference&lt;/EM&gt;.  By
doing this, we save copying a potentially large list, thus taking up less space,
and running faster.  If, for beauty reasons, you'd prefer not to backslash your
array, you could use prototypes:&lt;/P&gt;
&lt;C&gt;
  sub radix_sort (\@;$);
  
  radix_sort @list, $unknown;
  
  sub radix_sort (\@;$) {
    # ...
  }&lt;/C&gt;
&lt;P&gt;You could combine the declaration and the definition of the function, but the
prototype must be seen before the function call.&lt;/P&gt;
&lt;P&gt;
&lt;H3&gt;&lt;A NAME="exercises"&gt;Exercises&lt;/A&gt;&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;
Why does radix sort start with the right-most character in a string?
&lt;P&gt;&lt;/P&gt;
&lt;LI&gt;
Does the order of the elements in the input list effect the run-time of this
sorting algorithm?  What happens if the elements are already sorted?  Or in the
reverse sorted order?
&lt;P&gt;&lt;/P&gt;&lt;/OL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="sorting by index"&gt;Sorting by Index&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;Given the choice between sorting three lists and sorting one list, you'd choose
sorting one list, right?  Good.  This, then, is the strategy employed when you
sort by &lt;EM&gt;index&lt;/EM&gt;.  If you have three arrays that hold different information, yet
for a given index, the elements are all related -- we say these arrays hold data
in &lt;EM&gt;parallel&lt;/EM&gt; -- then it seems far too much work to sort all three arrays.&lt;/P&gt;
&lt;C&gt;
  @names  = qw( Jeff Jon Ray Tim Joan Greg );
  @ages   = qw( 19   14  18  14  20   19   );
  @gender = qw( m    m   m   m   f    m    );&lt;/C&gt;
&lt;P&gt;Here, all the data at index 3 ("Tim", 14, "m") is related, as it is for any other
index.  Now, if we wanted to sort these arrays so that this relationship stood,
but the lists were sorted in terms of age and then by name, then we would like
our data to look like:&lt;/P&gt;
&lt;C&gt;
  @names  = qw( Jon Tim Ray Greg Jeff Joan );
  @ages   = qw( 14  14  18  19   19   20   );
  @gender = qw( m   m   m   m    m    f    );&lt;/C&gt;
&lt;P&gt;But to actually sort &lt;EM&gt;these&lt;/EM&gt; lists requires 3 times the effort.  Instead, we
will sort the &lt;EM&gt;indices&lt;/EM&gt; of the arrays (from 0 to 5).  This is the function we
will use:&lt;/P&gt;
&lt;C&gt;
  sub age_or_name {
    return (
      $ages[$a] &lt;=&gt; $ages[$b]
      or
      $names[$a] cmp $names[$b]
    )
  }&lt;/C&gt;
&lt;P&gt;And here it is in action:&lt;/P&gt;
&lt;C&gt;
  @idx = sort age_or_name 0 .. $#ages;
  print "@ages\n";        # 19 14 18 14 20 19
  print "@idx\n";         #  1  3  2  5  0  4
  print "@ages[@idx]\n";  # 14 14 18 19 19 20&lt;/C&gt;
&lt;P&gt;As you can see, the array isn't touched, but the indices are given in such an
order that fetching the elements of the array in that order yields sorted data.&lt;/P&gt;
&lt;DL&gt;
&lt;DT&gt;&lt;DD&gt;
&lt;STRONG&gt;Note:&lt;/STRONG&gt; the &lt;TT&gt;$#ages&lt;/TT&gt; variable is related to the &lt;TT&gt;@ages&lt;/TT&gt; array -- it holds the
highest index used in the array, so for an array of 6 elements, &lt;TT&gt;$#array&lt;/TT&gt; is 5.
&lt;P&gt;&lt;/P&gt;&lt;/DL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="schwartzian transforms"&gt;Schwartzian Transforms&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;A common (and rather popular) idiom in Perl programming is the Schwartzian
Transform, an approach which is like "you set 'em up, I'll knock 'em down!"  It
uses the &lt;TT&gt;map()&lt;/TT&gt; function to transform the incoming data into a list of simple
data structures.  This way, the machinations done to the data set are only done
once (as in the Orcish Manuever).&lt;/P&gt;
&lt;P&gt;The general appearance of the transform is like so:&lt;/P&gt;
&lt;C&gt;
  @sorted =
    map { get_original_data($_) }
    sort { ... }
    map { transform_data($_) }
    @original;&lt;/C&gt;
&lt;P&gt;They are to be read in reverse order, since the first thing done is the &lt;TT&gt;map()&lt;/TT&gt;
that transforms the data, then the sorting, and then the &lt;TT&gt;map()&lt;/TT&gt; to get the
original data back.&lt;/P&gt;
&lt;P&gt;Let's say you had lines of a password file that were formatted as:&lt;/P&gt;
&lt;C&gt;
  username:password:shell:name:dir&lt;/C&gt;
&lt;P&gt;and you wanted to sort first by shell, then by name, and then by username.  A
Schwartzian Transform could be used like this:&lt;/P&gt;
&lt;C&gt;
  @sorted =
    map { $_-&gt;[0] }
    sort {
      $a-&gt;[3] cmp $b-&gt;[3]
      or
      $a-&gt;[4] cmp $b-&gt;[4]
      or
      $a-&gt;[1] cmp $b-&gt;[1]
    }
    map { [ $_, split /:/ ] }
    @entries;&lt;/C&gt;
&lt;P&gt;We'll break this down into the individual parts.&lt;/P&gt;
&lt;DL&gt;
&lt;DT&gt;&lt;STRONG&gt;&lt;A NAME="item_Step_1%2E_Transform_your_data%2E"&gt;Step 1. Transform your data.&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;
&lt;DD&gt;
We create a list of array references; each reference holds the original record,
and then each of the fields (as separated by colons).
&lt;C&gt;
  @transformed = map { [ $_, split /:/ ] } @entries;&lt;/C&gt;
&lt;P&gt;That could be written in a &lt;TT&gt;for&lt;/TT&gt;-loop, but understanding &lt;TT&gt;map()&lt;/TT&gt; is a powerful
tool in Perl.&lt;/P&gt;
&lt;C&gt;
  for (@entries) {
    push @transformed, [ $_, split /:/ ];
  }&lt;/C&gt;
&lt;P&gt;&lt;/P&gt;
&lt;DT&gt;&lt;STRONG&gt;&lt;A NAME="item_Step_2%2E_Sort_your_data%2E"&gt;Step 2. Sort your data.&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;
&lt;DD&gt;
Now, we sort on the needed fields.  Since the first element of our references is
the original string, the username is element 1, the name is element 4, and the
shell is element 3.
&lt;C&gt;
  @transformed = sort {
    $a-&gt;[3] cmp $b-&gt;[3]
    or
    $a-&gt;[4] cmp $b-&gt;[4]
    or
    $a-&gt;[1] cmp $b-&gt;[1]
  } @transformed;&lt;/C&gt;
&lt;P&gt;&lt;/P&gt;
&lt;DT&gt;&lt;STRONG&gt;&lt;A NAME="item_Step_3%2E_Restore_your_original_data%2E"&gt;Step 3. Restore your original data.&lt;/A&gt;&lt;/STRONG&gt;&lt;BR&gt;
&lt;DD&gt;
Finally, get the original data back from the structure:
&lt;C&gt;
  @sorted = map { $_-&gt;[0] } @transformed;&lt;/C&gt;
&lt;P&gt;&lt;/P&gt;&lt;/DL&gt;
&lt;P&gt;And that's all there is to it.  It may look like a daunting structure, but it is
really just three Perl statements strung together.&lt;/P&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="guttmanrosler transforms"&gt;Guttman-Rosler Transforms&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;Perl's regular sorting is very fast.  It's optimized.  So it'd be nice to be able
to use it whenever possible.  That is the foundation of the Guttman-Rosler
Transform, called the GRT, for short.&lt;/P&gt;
&lt;P&gt;The frame of a GRT is:&lt;/P&gt;
&lt;C&gt;
  @sorted =
    map { restore($_) }
    sort
    map { normalize($_) }
    @original;&lt;/C&gt;
&lt;P&gt;An interesting application of the GRT is to sort strings in a case-insensitive
manner.  First, we have to find the longest run of &lt;TT&gt;NUL&lt;/TT&gt;s in all the strings
(for a reason you'll soon see).&lt;/P&gt;
&lt;C&gt;
  my $nulls = 0;
  
  # find length of longest run of NULs
  for (@original) {
    for (/(\0+)/g) {
      $nulls = length($1) if length($1) &gt; $nulls;
    }
  }&lt;/C&gt;
&lt;C&gt;
  $NUL = "\0" x ++$nulls;&lt;/C&gt;
&lt;P&gt;Now, we have a string of nulls, whose length is one greater than the largest run
of nulls in the strings.  This will allow us to safely separate the lower-case
version of the strings from the original strings:&lt;/P&gt;
&lt;C&gt;
  # "\L...\E" is like lc(...)
  @normalized = map { "\L$_\E$NUL$_" } @original;&lt;/C&gt;
&lt;P&gt;Now, we can just send this to sort.&lt;/P&gt;
&lt;C&gt;
  @sorted = sort @normalized;&lt;/C&gt;
&lt;P&gt;And then to get back the data we had before, we split on the nulls:&lt;/P&gt;
&lt;C&gt;
  @sorted = map { (split /$NUL/)[1] } @original;&lt;/C&gt;
&lt;P&gt;Putting it all together, we have:&lt;/P&gt;
&lt;C&gt;
  # implement our for loop from above
  # as a function
  $NUL = get_nulls(\@original);
  
  @sorted =
    map { (split /$NUL/)[1] }
    sort
    map { "\L$_\E$NUL$_" }
    @original;&lt;/C&gt;
&lt;P&gt;The reason we use the &lt;TT&gt;NUL&lt;/TT&gt; character is because it has an ASCII value of 0, so
it's always less than or equal to any &lt;EM&gt;other&lt;/EM&gt; character.  Another way to
approach this is to pad the string with nulls so they all become the same length:&lt;/P&gt;
&lt;C&gt;
  # see Exercise 1 for this function
  $maxlen = maxlen(\@original);&lt;/C&gt;
&lt;C&gt;
  @sorted =
    map { substr($_, $maxlen) }
    sort
    map { lc($_) . ("\0" x ($maxlen - length)) . $_ }
    @original;&lt;/C&gt;
&lt;P&gt;Common functions used in a GRT are &lt;TT&gt;pack()&lt;/TT&gt;, &lt;TT&gt;unpack()&lt;/TT&gt;, and &lt;TT&gt;substr()&lt;/TT&gt;.  The
goal of a GRT is to make your data presentable as a string that will work in a
regular comparison.&lt;/P&gt;
&lt;P&gt;
&lt;H3&gt;&lt;A NAME="exercises"&gt;Exercises&lt;/A&gt;&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;
Write the &lt;TT&gt;maxlen()&lt;/TT&gt; function for the previous chunk of code.
&lt;P&gt;&lt;/P&gt;&lt;/OL&gt;
&lt;P&gt;
&lt;H2&gt;&lt;A NAME="portability"&gt;Portability&lt;/A&gt;&lt;/H2&gt;
&lt;P&gt;You can make a function to be used by &lt;TT&gt;sort()&lt;/TT&gt; to avoid writing potentially
messy sorting code inline.  For example, our Schwartzian Transform:&lt;/P&gt;
&lt;C&gt;
  @sorted =
    {
    $a-&gt;[3] cmp $b-&gt;[3]
    or
    $a-&gt;[4] cmp $b-&gt;[4]
    or
    $a-&gt;[1] cmp $b-&gt;[1]
  }&lt;/C&gt;
&lt;P&gt;However, if you want to declare that function in one package, and use it in
another, you run into problems.&lt;/P&gt;
&lt;C&gt;
  #!/usr/bin/perl -w&lt;/C&gt;
&lt;C&gt;
  package Sorting;&lt;/C&gt;
&lt;C&gt;
  sub passwd_cmp {
    $a-&gt;[3] cmp $b-&gt;[3]
    or
    $a-&gt;[4] cmp $b-&gt;[4]
    or
    $a-&gt;[1] cmp $b-&gt;[1]
  }&lt;/C&gt;
&lt;C&gt;
  sub case_insensitive_cmp {
    lc($a) cmp lc($b)
  }&lt;/C&gt;
&lt;C&gt;
  package main;&lt;/C&gt;
&lt;C&gt;
  @strings = sort Sorting::case_insensitive_cmp
    qw( this Mine yours Those THESE nevER );&lt;/C&gt;
&lt;C&gt;
  print "&lt;@strings&gt;\n";&lt;/C&gt;
&lt;C&gt;
  __END__
  &lt;this Mine yours Those THESE nevER&gt;&lt;/C&gt;
&lt;P&gt;This code doesn't change the order of the strings.  The reason is because
&lt;TT&gt;$a&lt;/TT&gt; and &lt;TT&gt;$b&lt;/TT&gt; in the sorting subroutine belong to &lt;TT&gt;Sorting::&lt;/TT&gt;, but the
&lt;TT&gt;$a&lt;/TT&gt; and &lt;TT&gt;$b&lt;/TT&gt; that &lt;TT&gt;sort()&lt;/TT&gt; is making belong to &lt;TT&gt;main::&lt;/TT&gt;.&lt;/P&gt;
&lt;P&gt;To get around this, you can give the function a prototype, and then it will
be passed the two arguments.&lt;/P&gt;
&lt;C&gt;
  #!/usr/bin/perl -w&lt;/C&gt;
&lt;C&gt;
  package Sorting;&lt;/C&gt;
&lt;C&gt;
  sub passwd_cmp ($$) {
    local ($a, $b) = @_;
    $a-&gt;[3] cmp $b-&gt;[3]
    or
    $a-&gt;[4] cmp $b-&gt;[4]
    or
    $a-&gt;[1] cmp $b-&gt;[1]
  }&lt;/C&gt;
&lt;C&gt;
  sub case_insensitive_cmp ($$) {
    local ($a, $b) = @_;
    lc($a) cmp lc($b)
  }&lt;/C&gt;
&lt;C&gt;
  package main;&lt;/C&gt;
&lt;C&gt;
  @strings = sort Sorting::case_insensitive_cmp
    qw( this Mine yours Those THESE nevER );&lt;/C&gt;
&lt;C&gt;
  print "&lt;@strings&gt;\n";&lt;/C&gt;
&lt;C&gt;
  __END__
  &lt;Mine nevER THESE this Those yours&gt;&lt;/C&gt;
&lt;P&gt;
&lt;HR&gt;
&lt;H1&gt;&lt;A NAME="author"&gt;AUTHOR&lt;/A&gt;&lt;/H1&gt;
&lt;P&gt;Jeff &lt;TT&gt;japhy&lt;/TT&gt; Pinyan, &lt;EM&gt;&lt;A HREF="mailto:japhy@perlmonk.org"&gt;japhy@perlmonk.org&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;A HREF="http://japhy.perlmonk.org/"&gt;http://japhy.perlmonk.org/&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;</field>
</data>
</node>
