Your skill will accomplishwhat the force of many cannot PerlMonks

### Unusual sorting requirements; comparing three implementations.

by tobyink (Abbot)
 on Oct 24, 2012 at 09:41 UTC Need Help??

An esteemed monk last night mentioned a problem of sorting a list of people alphabetically, but so that people with a job title of "Manager" are floated to the top of the list. I'm sure we can all think of similar problems we've had. For instance, sorting task/todo items by date, but sorting high priority tasks first.

Anyhow, it was late and I mumbled something about part from List::MoreUtils and wondered off to the monastery sleeping quarters.

Here I'd like to compare three implementations. Firstly, the naive approach:

```my @sorted = sort {
if (\$a->title =~ /Manager/ and not \$b->title =~ /Manager/) {
-1;
}
elsif (\$b->title =~ /Manager/ and not \$a->title =~ /Manager/) {
1;
}
else {
\$a->name cmp \$b->name;
}
} @employees;

If A is a manager but not B, then A goes first. If B is a manager, but not A, then B goes first. Otherwise (both are managers, or neither are managers), then sort by name. Simple; easy to understand.

But what are the drawbacks? The regexp is used four times, resulting in a lot of duplication. OK, so we could say my \$test = qr/Manager/ and stop repeating ourselves, but even so, the regexp is still being matched against quite a lot. In the case where A and B are both managers, the regexp is matched against four times within the block. And that happens for each pair of people compared.

OK, so actually we can use the return value of =~ (or !~) numerically and turn that to our advantage. A slightly subtler approach...

```my @sorted = sort {
(\$a->title !~ /Manager/) - (\$b->title !~ /Manager/)
or \$a->name cmp \$b->name;
} @employees;

Here the regexp match happens only twice within the block, which should improve the efficiency of the sorting.

Now let's take a look at what I was jabbering on about last night; using part to split the list into a list of managers and a list of non-managers. That way we can sort the lists separately just based on their names (quick and easy) and then join them together again.

```use List::MoreUtils qw( part );

my @sorted =
map  { sort { \$a->name cmp \$b->name } @\$_ }
part { \$_->title !~ /Manager/ }
@employees;

It's perhaps a little less readable than the first, naive approach, but it has some nice features. The /Manager/ regexp isn't repeated, and is only matched once for each person.

For those who are interested, here's a Test::More script that proves all three approaches work...

```use v5.14;
use strict;
use warnings;
use Test::More;

package Person {
use Moo;
has name  => (is => 'ro');
has title => (is => 'ro');
}

sub Person {
my (\$name, \$title) = @_;
Person::->new(name => \$name, title => \$title);
}

my @employees = (
Person("Eve", "Tech Support"),
Person("Alice", "Sales Manager"),
Person("Dave", "Janitor"),
Person("Fred", "Receptionist"),
Person("Bob", "Customer Services"),
Person("Greg", "Finance Manager"),
Person("Carol", "Marketing Manager"),
);

my @correct = (
Person("Alice", "Sales Manager"),
Person("Carol", "Marketing Manager"),
Person("Greg", "Finance Manager"),
Person("Bob", "Customer Services"),
Person("Dave", "Janitor"),
Person("Eve", "Tech Support"),
Person("Fred", "Receptionist"),
);

OBVIOUS_APPROACH: {

my @sorted = sort {
if (\$a->title =~ /Manager/ and not \$b->title =~ /Manager/) {
-1;
}
elsif (\$b->title =~ /Manager/ and not \$a->title =~ /Manager/)
+{
1;
}
else {
\$a->name cmp \$b->name;
}
} @employees;

is_deeply(\@sorted, \@correct, 'Obvious approach');
}

SUBTLER_APPROACH: {

my @sorted = sort {
(\$a->title !~ /Manager/) - (\$b->title !~ /Manager/)
or \$a->name cmp \$b->name;
} @employees;

is_deeply(\@sorted, \@correct, 'Subtler approach');
}

FUNCTIONAL_APPROACH: {
use List::MoreUtils qw( part );

my @sorted =
map  { sort { \$a->name cmp \$b->name } @\$_ }
part { \$_->title !~ /Manager/ }
@employees;

is_deeply(\@sorted, \@correct, 'Functional approach');
}

done_testing();

Now as I've hinted; my gut feeling was that the final, functional approach should be faster. This is for two reasons:

• The regexp match happens fewer times. Our regexp is pretty simple, but imagine we also needed to classify Directors as Managers; and the Chief Executive too. The regexp could quickly become more complicated.
• The Perl sorting algorithm is O(N log N) which means that it's quicker to sort two lists of lengths a and b than it is to sort a single combined list of length a+b.

How much faster... well, I generated a random list of 200 employees with numeric names, and a random job title of either "Manager" or "Staff", and ran the results though Benchmark...

```use v5.14;
use strict;
use warnings;
use List::MoreUtils qw( part );
use Benchmark ':all';

package Person {
use Moo;
has name  => (is => 'ro');
has title => (is => 'ro');
}

my @employees = map {
Person::->new(
name  => int(rand(999999)),
title => int(rand(2))?'Staff':'Manager',
)
} 0..200;

sub obvious {
my @sorted = sort {
if (\$a->title =~ /Manager/ and not \$b->title =~ /Manager/) {
-1;
}
elsif (\$b->title =~ /Manager/ and not \$a->title =~ /Manager/)
+{
1;
}
else {
\$a->name cmp \$b->name;
}
} @employees;
}

sub subtle {
my @sorted = sort {
(\$a->title !~ /Manager/) - (\$b->title !~ /Manager/)
or \$a->name cmp \$b->name;
} @employees;
}

sub functional {
my @sorted =
map  { sort { \$a->name cmp \$b->name } @\$_ }
part { \$_->title !~ /Manager/ }
@employees;
}

cmpthese(1000, {
obvious    => \&obvious,
subtle     => \&subtle,
functional => \&functional,
});

The results were impresssive...

```            Rate    obvious     subtle functional
obvious    102/s         --       -26%       -70%
subtle     137/s        34%         --       -59%
functional 336/s       228%       144%         --
```

That's right; the functional approach shaves off 70% of the time taken to sort the list of employees!

So remember, for prioritised sorting, part is your friend!

UPDATE:, using Sort::Key's keysort in conjunction with the functional approach doubles its speed once again!

```my @sorted =
map  { keysort { \$_->name } @\$_ }
part { \$_->title !~ /Manager/ }
@employees;

keysort is an XS implementation of the Schwartzian transform.

perl -E'sub Monkey::do{say\$_,for@_,do{(\$monkey=[caller(0)]->[3])=~s{::}{ }and\$monkey}}"Monkey say"->Monkey::do'

Replies are listed 'Best First'.
Re: Unusual sorting requirements; comparing three implementations.
by salva (Abbot) on Oct 24, 2012 at 12:45 UTC
keysort is an XS implementation of the Schwartzian transform
Well, not really. It is more like this:
```sub sortkey (\&@) {
my \$calc_key = shift;
my @key = map \$calc_key->(), @_;
return @_[sort { \$key[\$a] cmp \$key[\$b] } 0..\$#_];
}
Re: Unusual sorting requirements; comparing three implementations.
by BrowserUk (Pope) on Oct 24, 2012 at 10:36 UTC

Ignore. Dumb mistake. Corrected:

``` ...

sub x {
my @sorted = map \$_->[1],
sort{ \$a->[0] cmp \$b->[0] } map[
\$_->title eq 'Manager' ? 'A'.\$_->name : 'B'.\$_->name, \$_
], @employees;
}

cmpthese( -1, {
obvious    => \&obvious,
subtle     => \&subtle,
functional => \&functional,
x          => \&x,
});

__END__
C:\test>junk87
Rate    obvious     subtle functional          x
obvious     185/s         --       -15%       -60%       -83%
subtle      218/s        18%         --       -54%       -80%
functional  469/s       153%       115%         --       -58%
x          1111/s       500%       411%       137%         --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
```sub x {
my @sorted = map \$_->[1],
sort{ \$a->[0] cmp \$b->[0] } map[
\$_->title eq 'Manager' ? 'A'.\$_->name : 'B'.\$_->name, \$_
], @employees;
}

Faster, but not correct for the original problem. tobyink's Test::More script tested for ->title =~ /Manager/, not for ->title eq 'Manager', and so all those non-generic Manager employees, like the Sales Manager, the Finance Manager, and the Marketing Manager will be sorted as NON-Manager employees. tobyink's benchmark script wrongly reduces the possible titles to "Staff" and "Manager".

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Correcting that makes barely any difference to the performance:

```sub x {
my @sorted = map \$_->[1],
sort{ \$a->[0] cmp \$b->[0] } map[
\$_->title =~ /Manager/ ? 'A'.\$_->name : 'B'.\$_->name, \$_
], @employees;
}

cmpthese( -1, {
obvious    => \&obvious,
subtle     => \&subtle,
functional => \&functional,
x          => \&x,
});
C:\test>junk87
Rate    obvious     subtle functional          x
obvious     183/s         --       -18%       -60%       -83%
subtle      224/s        22%         --       -51%       -79%
functional  458/s       150%       105%         --       -57%
x          1070/s       483%       378%       134%         --

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Unusual sorting requirements; comparing three implementations.
by moritz (Cardinal) on Oct 24, 2012 at 17:11 UTC

You can exploit that fact that sort is stable, and do two passes over the data:

```sub two_pass {
my @sorted =
sort { \$b->title =~ /Manager/ <=> \$a->title =~ /Manager/ }
sort { \$a->name cmp \$b->name }
@employees;
}

Moo wouldn't install on my machine, so I used Mo instead for benchmarking:

```            Rate    obvious     subtle   two_pass functional
obvious    173/s         --       -22%       -30%       -61%
subtle     222/s        28%         --       -11%       -50%
two_pass   249/s        44%        12%         --       -44%
functional 444/s       156%       100%        79%         --

Of course it does more than necessary (splitting into managers and non-managers first is less work, because the lists to sort are shorter), but the speed still isn't too bad.

Update: and here's a Perl 6 solution:

```use v6;
class Person {
has \$.name;
has \$.title;
}

sub person(\$name, \$title) { Person.new(:\$name, :\$title) };
my @employees =
person("Eve", "Tech Support"),
person("Alice", "Sales Manager"),
person("Dave", "Janitor"),
person("Fred", "Receptionist"),
person("Bob", "Customer Services"),
person("Greg", "Finance Manager"),
person("Carol", "Marketing Manager"),
;
my @sorted =
@employees.classify({ .title ~~ /Manager/ ?? 'manager' !! 'staff'}
+)\
.sort.map(*.value.sort(*.name));

say .perl for @sorted;

Output:

```Person.new(name => "Alice", title => "Sales Manager")
Person.new(name => "Carol", title => "Marketing Manager")
Person.new(name => "Greg", title => "Finance Manager")
Person.new(name => "Bob", title => "Customer Services")
Person.new(name => "Dave", title => "Janitor")
Person.new(name => "Eve", title => "Tech Support")
Person.new(name => "Fred", title => "Receptionist")

.classify returns a Hash, and .sort on a Hash returns a list of Pair object, sorted by key and then by value. Since 'manager' lt 'staff', this puts all the managers first. Then each of the lists is sorted by name.

Re: Unusual sorting requirements; comparing three implementations.
by Anonymous Monk on Oct 24, 2012 at 23:00 UTC
Unless truly sorting bezillions of records, the naive function is something that looks self-evident and maintainable; none of the others do. That ought to count for something in the real world.

The keyword sort is a pretty big clue as to what the code does, and should be more than sufficient to inform the casual reader of the purpose of the code.

If the reader is less casual and needs to know how the sort operates -- perhaps because they wish to alter that ordering in some way -- then you'd have to hope they'd take a few minutes to understand that before making changes.

With factors up to 15x faster between the naive function and the more efficient ones, the idea that 5 minutes of a junior programmer's (or can't be bothered, senior programmer's) time is more valuable than that of the 10's (or 1000's or millions) of user's that will sit around waiting for the naive/lazy coder's efforts to finish, is a crock.

And it doesn't take "bezillions" of records for the delays to be significant. It only takes a few 10s of thousands of items to require the fastest algorithm tested here to take 1 second. The naive function would take over 15 seconds to do the same thing. And there are two detrimental aspects of that to consider:

• Firstly, if you are the user clicking on the order by price/popularity/availability on a shopping site, 1 second per click isn't going to have a detrimental effect on your psyche; but 15 seconds per click will.
• Secondly, it's not just time. It's also cpu.

Imagine Amazon or Google having to build 15 (or just 10 or 5) times as many of their mega data centres as they have now in order to deal with peak time loading.

Now think about the energy costs.

I don't know how many millions of times an hour Amazon or similar sites perform sorts; but I do know that if I was the boss I'd rather have those million sorts use 1/15th of the cpu/energy/time, even if that meant I had to pay for an extra 5 or 10 minutes of programmer time whenever a new programmer had to get to grips with how the efficient sort worked.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Unusual sorting requirements; comparing three implementations.
by sundialsvc4 (Abbot) on Oct 25, 2012 at 16:28 UTC

Perhaps this is mysteries that are beyond my ken, but reading through this thread I am quite puzzled by the number fifteen.   That is quite a remarkable difference, especially when the sort-algorithm does not change.   In the end, what is it about this algorithm that produces essentially a 2**4-fold improvement?   And, does this improvement hold in a linear fashion as the number of records grows (but stays within the confines of the available process working-set)?

It is, of course, quite clear how it works:   the list is sorted irrespective of name, then partitioned into manager vs. non-manager, then one sorted sublist is put in front of the other.   But fifteen is still a very counter-intuitive spread between the two ... I feel reluctant to shout “eureka!” based on just this.

Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1000586]
Approved by Corion
Front-paged by Arunbear
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2017-10-17 21:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My fridge is mostly full of:

Results (238 votes). Check out past polls.

Notices?