Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Schwartzian Transform

by merlyn (Sage)
on Apr 26, 2000 at 01:12 UTC ( #9108=snippet: print w/replies, xml ) Need Help??
Description: OK - gotta add it, because I can. {grin}

For a complete description of the steps, see my column on sorting.

my @output =
  map { $_->[0] }
  sort { $a->[1] cmp $b->[1] }
  map { [$_, expensive_func($_)] }
Replies are listed 'Best First'.
RE: Schwartzian Transform
by perlmonkey (Hermit) on Apr 27, 2000 at 12:36 UTC
    I thought I would add to this snippet since I had to stare at the code for far to long before I figured out how I might use it.

    Okay so we know it is for sorting, but why is it useful? It is a heck of a lot faster for sorting large data sets where the data has to be manipulated before sorting. The example I came up with is this:

    We have a huge list of names, and we want to sort them alphabetically by last name, first name, then middle name. That doesnt sound too bad. But the problem is that our database does not store last names, first names or middle names seperately. Instead there is just a single name field like "Arthur C. Clarke". So when we get all the data from the database first we have to put the string in alphapbetical sort order (ie make "Arthur C. Clarke" into "Clarke Arthur C.") then sort it.

    Someone without Randal Schwartz's insight (namely me) would have done something like this:
    my @sorted = sort mysort @unsorted; sub mysort { $a =~ m/(.*?)\s*(\S+)$/; $aa = "$2 $1"; $b =~ m/(.*?)\s*(\S+)$/; $bb = "$2 $1"; return $aa cmp $bb; }
    Now this would be bad for large data sets because each element in @unsorted will get sent to mysort several time, so we are then performing the match several times where we dont really need to.

    So in order to only perform the match once we go through the array and replaces each element with an array reference containing the original element (dont loose the original!) and the new sortable string. So the first step:
    @unsorted = ("Larry Wall", "Arthur C. Clarke"); @a = map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted; # # now @a is (["Larry Wall", "Wall Larry"], # ["Arthur C. Clarke", "Clarke Arthur C."]) #
    Now we do a simple sort on the second element in @a: @b = sort {$a->[1] cmp $b->[1]} @a; And finally we turn the 2D array back into a sorted 1D array restoring the original values: @sorted = map {$_->[0]} @b; If you cascade the function calls, then you remove the intermediate steps of my @a, and @b and you get the Schwartzian Transform:
    my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
    Now that is pretty cool stuff, but how much of a time savings is this extra work? Well, I did a small benchmarking test to mimic what a time savings would under a large data set:
    #!/usr/bin/perl use Benchmark; @unsorted = ("Larry Wall", "Jane Sally Doe", "John Doe", "Morphius", "Jane Alice Doe", "Arthur C. Clarke"); timethese(100000, { "schwartzian" => sub { my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" +] } @unsorted }, "sort routine" => sub { my @sorted = sort mysort @unsorted } }); sub mysort { $a =~ m/(.*?)\s*(\S+)$/; $aa = "$2 $1"; $b =~ m/(.*?)\s*(\S+)$/; $bb = "$2 $1"; return $aa cmp $bb; }
    And the results:
    Benchmark: timing 100000 iterations of schwartzian, sort routine... schwartzian: 57 wallclock secs (53.66 usr + 0.30 sys = 53.96 CPU) sort routine: 124 wallclock secs (122.48 usr + 0.41 sys = 122.89 CPU)

    I hope this helps.
      Not a bad dissection, but it probably should be pointed out that Tom Christiansen covered it in one of his FMTEYEWTK (Far More Than Everything You Ever Wanted To Know) papers, which you can find under CPAN's documentation section, and it's covered in the Perl Cookbook (Christiansen and Nathan Torkington) in section 4.15.
        In fact, the Usenet message that became the "FMTYEWTK about Sorting" was the message in which the Schwartzian Transform got its name!

        In other words, it was named after me, but not by me. I'd never do anything as vain as that. Not that I'd admit to, anyway. :)

      i hate to get picky, but your Schwartzian transform doesn't look like merlyn's to me.

      my @sorted = map{ $_->[0] } sort {$a->[1] cmp $b->[1]} map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
      my @output = map { $_->[1] } sort { $a->[0] cmp $b->[1] } map { [$_, expensive_func($_)] } @input;
      is this because the ST is more general than i thought, or is there a typo in merlyn's ST writeup? i'm not sure i'm comfortable with any sort that compares $a->[0] to $b->[1]... is there ever a good reason to do that?

      update: looks like merlyn's changed his to compare index 1 of both arrays. which is almost a shame; i was hoping there was some really cool reason to compare different indices. but anyway, i withdraw the question.

      This helps.

      This node may or may not have said the same thing ten other people have said. But significantly, this time I understood!
RE (tilly) 1: Schwartzian Transform
by tilly (Archbishop) on Aug 17, 2000 at 19:12 UTC
    We discussed this one once. When you did it, it was not a big deal at the time as far as you were concerned. I absolutely understood.

    This falls out as a natural consequence of starting to think about Perl as a list-oriented language (which it is) instead of thinking like you would in C. It has often been mentioned that Perl is the Cliff Notes of Unix. Well that includes the value of thinking in terms of hooking together a bunch of trivial little filters into a pipeline.

    That is exactly you were doing.

    Unfortunately in Perl the syntax makes a pipeline read the wrong way. But let us use a fake notation where |> means "pipe to", what you are doing is the same as:

    @input |> map {[$_, &expensive_func($_)} |> sort { $a->[1] cmp $b->[1] } |> map {$_->[0]} |> my @output;
    And now, written in a pseudo imitation of a Unix pipeline where actions naturally read in the same order they execute in, my analogy to a Unix pipeline should be clear.

    For the record I often mentally write filters in a pseudo pipeline like the above, you apparently write them bottom to top. Either way anyone who gets past the notational issue and thinks "Unix pipeline" not only will find the Schwartzian sort trivial, they will also get a lot of insight into how to effectively use the underlying key idea for a wide variety of problems.

      But let us use a fake notation where |> means "pipe to",
      perl v6.0?
        I have no idea if it is under consideration. Probably not. I suggested it on p5p once to decidedly mixed reviews.

        I note that in many other languages similar algorithms are naturally coded as chained method calls, and that winds up reading left to right...

        perl v6.0?
        yep! Guess they liked '==>'.

        "A Jedi uses the Force for knowledge and defense, never for attack."
RE: Schwartzian Transform
by vroom (Pope) on Apr 26, 2000 at 01:51 UTC
RE: Schwartzian Transform
by perlcgi (Hermit) on Apr 26, 2000 at 15:42 UTC
    Fellow Seekers of Perl Wisdom, maybe I'm wrong, but I suspect the Schwartzian Transform be only good for datas that will fit in memory all at once. Given that Perl 5.6 can munge with files over 2Gb, (on systems that support files this big), can anyone suggest an elegant mod of the transform to accommodate big mother f.. files!
      Use a tied array (@input) that operates on a disk file?

        First of all mixing references and arrays tied to disk without thinking carefully about it is asking for serious problems.

        Secondly passing the array to Perl's sort function is asking for very serious trouble. That will pass it all into memory!

        What I would do for large data structures would be to use DB_File to tie a hash to a BTree, and use properly formatted strings as keys. Blech. But it will work up to the maximum file size for your OS. (OK, up to about half that - BTrees waste something like 40% of the space in the tree.) Or use a proper database.

        Thanks chromatic, Despite the tie of @input, should not the list of references to lists overflow with very large files. Also should @a and @b be tied and what about the anonymous array. Is it possible to tie anonymous arrays, and if so would it, and the all the others require serialization? TIA
RE: Schwartzian Transform
by BigJoe (Curate) on Aug 17, 2000 at 17:14 UTC
    I liked the write up about Schwartzian Transform that is in the book Linux Programming Unleashed.


    Learn patience, you must.
    Young PerlMonk, craves Not these things.
    Use the source Luke.
Schwartzian Transform vs. Building Hash of Function Results, Then Sorting on Function Results
by davebaker (Monk) on Oct 16, 2006 at 18:54 UTC
    I'm trying to better understand the Schwartzian Tranform (sorting a list by computable field). I've read the explanation in the Perl Cookbook, 2d ed. recipe 4.16. I also enjoyed Randal's 1996 article in the Unix Review, which describes a sortable-hash technique before it goes on to show the map-sort-map transform technique. The sortable-hash technique intrigues me. Other than coolness and nicer-looking code, what is the advantage of this:
    my @output = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [$_, expensive_func($_)] } @input;
    over this:
    foreach $_ (@input) { $result_for{$_} = expensive_func($_); } my @output = sort { $result_for{$a} cmp $result_for{$b} } @input;
      Consider sorting a list of objects that have a stringification overload and two of those objects stringify to the same string. With the ST, everything just works. With the hash method, the two objects collapse to the same hash entry.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

      In both cases, you are calculating the expensive_func() only once per item and sorting the list based off the result of the expensive_func(). The biggest difference is that in the standard ST, you do it in a single step and have no left over variable.

      Cheers - L~R

        Thanks! I hadn't considered the left over variable. I'd want to add this line before my sorted-hash code snippet:
        my %result_for;
        (That doesn't eliminate any drawbacks to having a left over variable as compared to the ST, but it would be safer coding I think.)
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://9108]
[Lady_Aleena]: shmem, that is understandable! The two examples in File::Find don't make sense to me on a quick glance.
[marioroy]: LA if the find worked from Unix command line, or does it not. Likely a quoting issue inside Perl qx.
[Lady_Aleena]: marioroy, the find worked fine at the command line.
[marioroy]: LA, yeah. than there's no reason why it cannot work inside qx. But chatting is hard in PM. I cannot see the code now.
[shmem]: Lady_Aleena: sometimes a quick glance isn't enough.

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2017-04-23 20:58 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (432 votes). Check out past polls.