Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re^3: RFC: I rewrote a custom sort function, but not sure if it is the best way.

by roboticus (Chancellor)
on Mar 03, 2013 at 15:46 UTC ( #1021505=note: print w/replies, xml ) Need Help??

in reply to Re^2: RFC: I rewrote a custom sort function, but not sure if it is the best way.
in thread RFC: I rewrote a custom sort function, but not sure if it is the best way.


Regarding "my_sort_index": Oh, yeah, I totally forgot to write that.

Well, when writing comparison routines, you first want to determine what all the cases are that you want to handle, and what to do for each case. For ensuring that strings beginning with "index." sort to the top of the list, there are four cases:

  1. Neither string has the desired prefix: For this case, we simply return 0 and let the next bit of code choose the order.
  2. Only the first string has the desired prefix: Here, we return -1 to make the first string rise to the top of the list.
  3. Only the second string has the desired prefix: Here, we return +1 to make the second string rise to the top of the list.
  4. Both strings have the desired prefix: In this case, we don't really care which one is first. The first three cases ensure that the strings with the prefix float to the top. So we'll return 0 to let the next bit of code check the rest of the order.

So, here's my (untested!) version of the my_sort_index routine:

sub my_sort_index { my ($c, $d) = @_; # We want any item starting with "index." to sort to the top of # the list. The rest of the ordering is handled by the caller. # So there are four cases we must handle: # # Matches? # $c $d return # No No 0 # Yes No -1 # No Yes +1 # Yes Yes 0 (-1 + +1 == 0) # # In the fourth case, when both strings have the "index." prefix, # we want them to sort via the normal method so that "index.1" wil +l # always sort before "index.2". my $return_value = 0; --$return_value if $c =~ /^index\./; ++$return_value if $d =~ /^index\./; return $return_value; }

We could enforce the order in the fourth case, but it would make the subroutine a bit less useful. Why? It makes the subroutine harder to use with other subroutines for a sorting job, because it forces an order beyond just making sure that strings prefixed with "index." go first. In other words, if neither string began with index, it would use a different ordering than if both strings began with index. That can be problem.

Suppose you want all strings with "index." to sort to the top of the list, but you want everything sorted in reverse order beyond that, you'd have to have another version of your code. With the code we have now, though, we could do:

my @sorted = sort { my_index_sort($a,$b) || $b cmp $a } @list;

Regarding the Schwartzian transform: I wanted you to be aware of the technique, because it can let you do some pretty interesting things. When I use them, it's generally because I think to myself, "Oh, this would be *so* much easier to sort if everything looked like XYZ". Then I try to figure out an easy way to take what I have and turn it into XYZ.

In the case of sorting names, for example. As you know, to handle names well, you'll have a good bit of testing and comparison and such to arrange the name(s). and there are quite a few special cases to worry about. In this case, I'd think to myself. "Oh, this would be *so* much easier to sort if the names were split into first, middle and last name with the extra bits at the end. Then I could just sort them normally." Then I figure out how to do that.

sub convert_name_to_L_F_M_Extras { # Lots of ugly code to split the name into bits, determine which b +it goes where, and # return an array of [ LastName, FirstName, MiddleName, Extras ], +like so: # "Dr. John Smith" => [ Smith, John, undef, Dr. ] # "Eucalyptus A. Tree, Esquire" => [ Tree, Eucalyptus, A., Esquire + ] return [ $last, $first, $middle, $rest ]; } my @list = ( "Dr. John Smith", "Eucalyptus A. Tree, Esquire", "Robotic +us", "Lady Aleena, Baker of cookies" ); my @transformed_list = map { [ convert_name_to_L_F_M_Extras($_), $_ ], + } @list; # Now looks like: # @transformed_list = ( # [ [ "Smith", "John", undef, "Dr" ], "Dr. John Smith" +], # [ [ "Tree", "Eucalyptus", "A.", "Esquire" ], "Eucalyptus A. Tr +ee, Esquire" ], # ... # ); sub sort_by_name { # $a and $b contain an arrayref, the first entry has an arrayref w +ith last, first, etc. # Sort by last name: $a->[0][0] cmp $b->[0][0] # If they have the same last name, sort by first name: || $a->[0][1] cmp $b->[0][1] ... etc ... } my @sorted = sort {sort_by_name($a,$b)} @transformed_list; # Now throw away the transformed names and keep only the list of sorte +d names. @sorted = map { $_->[1] } @sorted;

Now, having said all that, I mainly mentioned it to help you shift your perspective on sorting. I don't know that you'll necessarily be able to use it, and the case I created above is my best guess of how it *might* be useful to you.


When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^4: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by Lady_Aleena (Curate) on Mar 04, 2013 at 09:08 UTC

    roboticus, I decided to give the name transform a try, and so far it looks pretty good.

    #!/usr/bin/perl use strict; use warnings; my @roman_numerals = qw(I II III IV V VI VII VIII IX X); my $roman_numerals_string = join('|',@roman_numerals); sub name_transform { my $name = shift; # If a suffix is so long it needs a comman, let's get it here. my ($base_name,$pre_suffix) = split(', ',$name); # Split the rest of the name by the spaces. my @name = split(' ',$base_name); # Now check the first array item to see if it is a common prefix. # Some are there for fun. my $prefix = $name[0] =~ /(?:Lady|Lord|[MD][rs]|Mrs|Miss|Pres|Gov|Se +n|officer)(?:|\.)/ ? shift @name : ''; # Now check the last item of the array to see if it matches some com +mon suffixes. # More Roman numerals can be aded. my $suffix = $pre_suffix ? $pre_suffix : $name[-1] =~ /(?:Jr|Sr|Esq| +$roman_numerals_string)(?:|\.)/ ? pop @name : ''; # All which should be left is the bare name. Even if only the first +name is left, # it will be treated as the last name and maybe sorted accordingly. my $last_name = pop @name; my $first_name = shift @name // ''; # Every name left should be middle names. my $middle_name = @name ? join(' ',@name) : ''; return [$last_name,$first_name,$middle_name,$prefix,$suffix]; } local $\ = "\n"; my @names = ('President Barack Hussein Obama II','Mrs. Amanda King','D +r. Feelsgood', 'Miss America','Officer Andy','Henry VIII', "Dr. John Smi +th", "Eucalyptus A. Tree, Esquire","Roboticus", "Lady Aleena, +Baker of cookies", 'Aleena Zarahlinda ibn Robert al-Hajnal Chaoshi-Mnemosyni +od I'); for my $person (@names) { my $names = name_transform($person); print join('|',@{$names}); }


    Obama|Barack|Hussein|President|II King|Amanda||Mrs.| Feelsgood|||Dr.| America|||Miss| Andy|Officer||| Henry||||VIII Smith|John||Dr.| Tree|Eucalyptus|A.||Esquire Roboticus|||| Aleena|||Lady|Baker of cookies Chaoshi-Mnemosyniod|Aleena|Zarahlinda ibn Robert al-Hajnal||I

    What do you think?

    Have a cookie and a very nice day!
    Lady Aleena
      What about Chiang Kai-shek? See Chinese name for more fun.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        choroba, there are so many rules about names I do not think I can cover them all. I did not cover van/von either. Depending on the country one where the name is being filed, van Helsing could be filed under H (Netherlands and Suriname) or V (Belgium). Van is also a Vietnamese middle name. The Dutch have a lot of surname prefixes which could be added to the surname. I am sure there are other rules I have missed. There are a whole slew of nobiliary particles each with their own sort rules I imagine.

        Have a cookie and a very nice day!
        Lady Aleena


      It looks pretty good. I'm glad I was able to be helpful. Sometimes I suspect that I'm informative without being helpful. ;^)


      When your only tool is a hammer, all problems look like your thumb.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1021505]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2018-06-21 18:37 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (118 votes). Check out past polls.