RE: Schwartzian Transform
by perlmonkey (Hermit) on Apr 27, 2000 at 12:36 UTC
|
I thought I would add to this snippet since I had to stare at the code
for far to long before I figured out how I might use it.
Okay so we know it is for sorting, but why is it useful? It is
a heck of a lot faster for sorting large data sets where the data has to be
manipulated before sorting. The example I came up with is this:
We have a huge list of names, and we want to sort them alphabetically
by last name, first name, then middle name. That doesnt sound too
bad. But the problem is that our database does not store last names, first names
or middle names seperately. Instead there is just a single name field like
"Arthur C. Clarke". So when we get all the data from the database first we
have to put the string in alphapbetical sort order (ie make "Arthur C. Clarke"
into "Clarke Arthur C.") then sort it.
Someone without Randal Schwartz's insight (namely me) would have done something like
this:
my @sorted = sort mysort @unsorted;
sub mysort
{
$a =~ m/(.*?)\s*(\S+)$/;
$aa = "$2 $1";
$b =~ m/(.*?)\s*(\S+)$/;
$bb = "$2 $1";
return $aa cmp $bb;
}
Now this would be bad for large data sets because each element in @unsorted
will get sent to mysort several time, so we are then performing the match
several times where we dont really need to.
So in order to only perform the match once we go through the array
and replaces each element with an array reference containing the original
element (dont loose the original!) and the new sortable string.
So the first step:@unsorted = ("Larry Wall", "Arthur C. Clarke");
@a = map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
#
# now @a is (["Larry Wall", "Wall Larry"],
# ["Arthur C. Clarke", "Clarke Arthur C."])
#
Now we do a simple sort on the second element in @a:
@b = sort {$a->[1] cmp $b->[1]} @a;
And finally we turn the 2D array back into a sorted 1D array restoring
the original values:
@sorted = map {$_->[0]} @b;
If you cascade the function calls, then you remove the intermediate
steps of my @a, and @b and you get the Schwartzian Transform:
my @sorted =
map{ $_->[0] }
sort {$a->[1] cmp $b->[1]}
map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] }
@unsorted;
Now that is pretty cool stuff, but how much of a time savings is this extra
work? Well, I did a small benchmarking test to mimic what a time savings
would under a large data set:
#!/usr/bin/perl
use Benchmark;
@unsorted = ("Larry Wall",
"Jane Sally Doe",
"John Doe",
"Morphius",
"Jane Alice Doe",
"Arthur C. Clarke");
timethese(100000, {
"schwartzian" => sub { my @sorted =
map{ $_->[0] }
sort {$a->[1] cmp $b->[1]}
map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1"
+] }
@unsorted },
"sort routine" => sub { my @sorted = sort mysort @unsorted }
});
sub mysort
{
$a =~ m/(.*?)\s*(\S+)$/;
$aa = "$2 $1";
$b =~ m/(.*?)\s*(\S+)$/;
$bb = "$2 $1";
return $aa cmp $bb;
}
And the results:
Benchmark: timing 100000 iterations of schwartzian, sort routine...
schwartzian: 57 wallclock secs (53.66 usr + 0.30 sys = 53.96 CPU)
sort routine: 124 wallclock secs (122.48 usr + 0.41 sys = 122.89 CPU)
I hope this helps. | [reply] [d/l] [select] |
|
Not a bad dissection, but it probably should be pointed out
that Tom Christiansen covered it in one
of his FMTEYEWTK (Far More Than Everything You Ever Wanted
To Know) papers, which you can find under CPAN's
documentation section, and it's covered
in the Perl Cookbook (Christiansen
and Nathan Torkington) in section 4.15.
| [reply] |
|
In fact, the Usenet message that became the "FMTYEWTK about Sorting" was the message in which the Schwartzian Transform got its name!
In other words, it was named after me, but not by me. I'd never do anything as vain as that. Not that I'd admit to, anyway. :)
| [reply] |
|
|
i hate to get picky, but your Schwartzian transform doesn't look like merlyn's to me.
perlmonkey's:
my @sorted =
map{ $_->[0] }
sort {$a->[1] cmp $b->[1]}
map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] }
@unsorted;
merlyn's:
my @output =
map { $_->[1] }
sort { $a->[0] cmp $b->[1] }
map { [$_, expensive_func($_)] }
@input;
is this because the ST is more general than i thought, or is there a typo in merlyn's ST writeup? i'm not sure i'm comfortable with any sort that compares $a->[0] to $b->[1]... is there ever a good reason to do that?
update: looks like merlyn's changed his to compare index 1 of both arrays. which is almost a shame; i was hoping there was some really cool reason to compare different indices. but anyway, i withdraw the question.
| [reply] [d/l] [select] |
|
This helps.
This node may or may not have said the same thing ten other people have said. But significantly, this time I understood!
Thanks
| [reply] |
RE (tilly) 1: Schwartzian Transform
by tilly (Archbishop) on Aug 17, 2000 at 19:12 UTC
|
We discussed this one once. When you did it, it was not a
big deal at the time as far as you were concerned. I
absolutely understood.
This falls out as a natural consequence of starting to
think about Perl as a list-oriented language (which it is)
instead of thinking like you would in C. It has often been
mentioned that Perl is the Cliff Notes of Unix. Well that
includes the value of thinking in terms of hooking together a bunch of trivial little filters into a
pipeline.
That is exactly you were doing.
Unfortunately in Perl the syntax makes a pipeline read the
wrong way. But let us use a fake notation where |> means "pipe to", what you are doing is the same as:
@input |> map {[$_, &expensive_func($_)}
|> sort { $a->[1] cmp $b->[1] }
|> map {$_->[0]}
|> my @output;
And now, written in a pseudo imitation of a Unix pipeline
where actions naturally read in the same order they execute
in, my analogy to a Unix pipeline should be clear.
For the record I often mentally write filters in a pseudo
pipeline like the above, you apparently write them bottom
to top. Either way anyone who gets past the notational
issue and thinks "Unix pipeline" not only will find the
Schwartzian sort trivial, they will also get a lot of
insight into how to effectively use the underlying key
idea for a wide variety of problems. | [reply] [d/l] |
|
But let us use a fake notation where |> means "pipe to",
perl v6.0?
| [reply] |
|
| [reply] |
|
|
perl v6.0?
yep! Guess they liked '==>'.
---
"A Jedi uses the Force for knowledge and defense, never for attack."
| [reply] |
RE: Schwartzian Transform
by vroom (His Eminence) on Apr 26, 2000 at 01:51 UTC
|
| [reply] |
RE: Schwartzian Transform
by perlcgi (Hermit) on Apr 26, 2000 at 15:42 UTC
|
Fellow Seekers of Perl Wisdom, maybe I'm wrong, but I suspect the Schwartzian Transform be only good for datas that will fit in memory all at once. Given that Perl 5.6 can munge with files over 2Gb, (on systems that support files this big), can anyone suggest an elegant mod of the transform to accommodate big mother f.. files!
| [reply] |
|
Use a tied array (@input) that operates on a disk file?
| [reply] |
|
No.
First of all mixing references and arrays tied to disk without thinking carefully about it is asking for serious problems.
Secondly passing the array to Perl's sort function is asking for very serious trouble. That will pass it all
into memory!
What I would do for large data structures would be to use DB_File to tie a hash to a BTree, and use properly formatted strings as keys. Blech. But it will work up to the maximum file size for your OS. (OK, up to about half that - BTrees waste something like 40% of the space in the tree.) Or use a proper database.
| [reply] |
|
Thanks chromatic,
Despite the tie of @input, should not the list of references to lists overflow with very large files. Also should @a and @b be tied and what about the anonymous array. Is it possible to tie anonymous arrays, and if so would it, and the all the others require serialization?
TIA
| [reply] |
RE: Schwartzian Transform
by BigJoe (Curate) on Aug 17, 2000 at 17:14 UTC
|
I liked the write up about Schwartzian Transform that is in the book Linux Programming Unleashed.
--BigJoe
Learn patience, you must. Young PerlMonk, craves Not these things. Use the source Luke. | [reply] |
Schwartzian Transform vs. Building Hash of Function Results, Then Sorting on Function Results
by davebaker (Pilgrim) on Oct 16, 2006 at 18:54 UTC
|
I'm trying to better understand the Schwartzian Tranform (sorting a list by computable field). I've read the explanation in the Perl Cookbook, 2d ed. recipe 4.16. I also enjoyed Randal's 1996 article in the Unix Review, which describes a sortable-hash technique before it goes on to show the map-sort-map transform technique. The sortable-hash technique intrigues me.
Other than coolness and nicer-looking code, what is the advantage of this:
my @output =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [$_, expensive_func($_)] }
@input;
over this:
foreach $_ (@input) {
$result_for{$_} = expensive_func($_);
}
my @output
= sort { $result_for{$a} cmp $result_for{$b}
} @input;
| [reply] [d/l] [select] |
|
Consider sorting a list of objects that have a stringification overload and two of those objects stringify to the same string. With the ST, everything just works. With the hash method, the two objects collapse to the same hash entry.
| [reply] |
|
davebaker,
In both cases, you are calculating the expensive_func() only once per item and sorting the list based off the result of the expensive_func(). The biggest difference is that in the standard ST, you do it in a single step and have no left over variable.
| [reply] |
|
Thanks! I hadn't considered the left over variable. I'd want to add this line before my sorted-hash code snippet:
my %result_for;
(That doesn't eliminate any drawbacks to having a left over variable as compared to the ST, but it would be safer coding I think.) | [reply] [d/l] |