Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^2: RFC: I rewrote a custom sort function, but not sure if it is the best way.

by Lady_Aleena (Curate)
on Mar 03, 2013 at 07:47 UTC ( #1021487=note: print w/replies, xml ) Need Help??

in reply to Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.
in thread RFC: I rewrote a custom sort function, but not sure if it is the best way.

This may be disjointed, I am a little tired after the reading and all the other things I did today. With what I have so far, I have 47 instances of my_sort to alter.

Thank you roboticus for you really long explanation of where I went wrong. I made most of the changes you recommended with a few exceptions.

  • I did not use my_sort_index since I do not know what you had in mind for it. However, I want it in both sort functions since I have directories with files I want sorted by name with their parent index.(pl|shtml|html) file in the mix.
  • I can not figure out how to get the Schwartzian transform into the functions. Where do they go in my_article_sort and my_name_sort? Are they a whole other function?

davido, I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples. The way I use the sort functions now is as follows.

for my $movie (sort { my_article_sort($a->{title},$b->{title}) values +%movies} { ... } for my $alpha (sort { my_article_sort($a,$b) } keys %alpha_movies) { . +.. } for my $character (sort { my_name_sort($a->{name},$b->{name}) values % +player_characters} { ... } for my $color (sort { my_article_sort($a->{name},$b->{name}) values %c +olors} { ... } my @files = (map("$data_dir$_",grep(/txt$/,sort { my_article_sort($a,$ +b) } readdir($directory))));

I looked at both the transform links but got really lost. As stated above, I do not know how they can be added to the two sort functions. Also, I still have problems with map BLOCK (and grep BLOCK). After about six or eight tries to get a map to work, I usually end up doing a for loop to get the desired result whether it be a concatenation, some regex, or math. It has been a long time since I wrote a map, and at the time I had to have someone holding my hand while I was doing it. The map/grep above is about the best I can do at the moment.

7stud, I have begun work on putting the dispatch table into a function. I am not sure of my approach just yet. At least the functions are no longer longer than the screen height. Oh, I looked at Sort::Maker. I do not understand it one bit.

The following is where I am at the moment. (Please ignore short_sorts, it is still a mess. I hardly ever use it.)

package Base::Sorts; use strict; use warnings FATAL => qw( all ); use base 'Exporter'; our @EXPORT_OK = qw(my_article_sort my_name_sort short_sorts); use Carp qw(croak); use HTML::Entities qw(decode_entities); sub split_out_leading_number { my $s = shift; if ( $s =~/^((\d|,|\.)+)(.*)$/) { my ($leading_number, $rest) = ($1,$3); # Take any commas out of the number. $leading_number =~ s/,//g; return ($leading_number, $rest); } die "split_out_leading_number received bogus input '$s'!\n"; } sub my_article_sort { my ($c,$d) = @_; $c = lc($c); $d = lc($d); # When sorting lists of files, I want the index file to always come +first. if ($c =~ /^index\./) { return -1; } elsif ($d =~ /^index\./) { return 1; } else { # This is the default sorting method. # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading articl +es (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { my ($num1, $text1) = split_out_leading_number($c); my ($num2, $text2) = split_out_leading_number($d); # First compare the numbers, then compare the remaining parts of + the string. $num1 <=> $num2 || $text1 cmp $text2 } else { $c cmp $d; } } } sub my_name_sort { my ($c,$d) = @_; $c = lc($c); $d = lc($d); # When sorting lists of files, I want the index file to always come +first. # There may be an index file in a folder of files I want sorted by n +ame. if ($c =~ /^index\./) { return -1; } elsif ($d =~ /^index\./) { return 1; } else { # When I sort by name I prefer lastname firstname. # I have not yet written this to account for Sr., Jr., or Roman nu +merals after the last name. for ($c,$d) { s/<.+?>//g; # Strip out any html tags. s/\|.+$//; $_ = join(' ', (reverse split(/(?:_|\s)(?=[^_\s]+$)/, $_,2))) if + $_ !~ /^_/; s/^_//; decode_entities($_); } return $c cmp $d; } } sub short_sorts { my ($a,$b,$type) = @_; # Legend: # s = case sensitive # i = case insensitive # a = ascending # d = descending # r = reverse (right to left) # n = numbers # l = length of value my %sorts; $sorts{$_} = sub { $_[0] cmp $_[1] } for qw(sa as); $sorts{$_} = sub { $_[1] cmp $_[0] } for qw(sd ds); $sorts{$_} = sub { uc $_[0] cmp uc $_[1] } for qw(ia ai); $sorts{$_} = sub { uc $_[1] cmp uc $_[0] } for qw(id di); $sorts{$_} = sub { $_[0] <=> $_[1] } for qw(na an); $sorts{$_} = sub { $_[1] <=> $_[0] } for qw(nd dn); $sorts{$_} = sub { reverse($_[0]) cmp reverse($_[1]) } for qw(sar sr +a asr ars rsa ras); $sorts{$_} = sub { reverse($_[1]) cmp reverse($_[0]) } for qw(sdr sr +d dsr drs rsd rds); $sorts{$_} = sub { uc reverse($_[0]) cmp uc reverse($_[1]) } for qw( +iar ira air ari ria rai); $sorts{$_} = sub { uc reverse($_[1]) cmp uc reverse($_[0]) } for qw( +idr ird dir dri rid rdi); $sorts{$_} = sub { reverse $_[0] <=> reverse $_[1] } for qw(nar nra +anr arn rna ran); $sorts{$_} = sub { reverse $_[1] <=> reverse $_[0] } for qw(ndr nrd +dnr drn rnd rdn); $sorts{$_} = sub { length($_[0]) <=> length($_[1]) } for qw(la al); $sorts{$_} = sub { length($_[1]) <=> length($_[0]) } for qw(ld dl); if ($type) { croak "$type is not supported" if !exists $sorts{$type}; return $sorts{$type}->($a,$b); } else { die "A sort type was not selected."; } }
Have a cookie and a very nice day!
Lady Aleena

Replies are listed 'Best First'.
Re^3: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by roboticus (Chancellor) on Mar 03, 2013 at 15:46 UTC


    Regarding "my_sort_index": Oh, yeah, I totally forgot to write that.

    Well, when writing comparison routines, you first want to determine what all the cases are that you want to handle, and what to do for each case. For ensuring that strings beginning with "index." sort to the top of the list, there are four cases:

    1. Neither string has the desired prefix: For this case, we simply return 0 and let the next bit of code choose the order.
    2. Only the first string has the desired prefix: Here, we return -1 to make the first string rise to the top of the list.
    3. Only the second string has the desired prefix: Here, we return +1 to make the second string rise to the top of the list.
    4. Both strings have the desired prefix: In this case, we don't really care which one is first. The first three cases ensure that the strings with the prefix float to the top. So we'll return 0 to let the next bit of code check the rest of the order.

    So, here's my (untested!) version of the my_sort_index routine:

    sub my_sort_index { my ($c, $d) = @_; # We want any item starting with "index." to sort to the top of # the list. The rest of the ordering is handled by the caller. # So there are four cases we must handle: # # Matches? # $c $d return # No No 0 # Yes No -1 # No Yes +1 # Yes Yes 0 (-1 + +1 == 0) # # In the fourth case, when both strings have the "index." prefix, # we want them to sort via the normal method so that "index.1" wil +l # always sort before "index.2". my $return_value = 0; --$return_value if $c =~ /^index\./; ++$return_value if $d =~ /^index\./; return $return_value; }

    We could enforce the order in the fourth case, but it would make the subroutine a bit less useful. Why? It makes the subroutine harder to use with other subroutines for a sorting job, because it forces an order beyond just making sure that strings prefixed with "index." go first. In other words, if neither string began with index, it would use a different ordering than if both strings began with index. That can be problem.

    Suppose you want all strings with "index." to sort to the top of the list, but you want everything sorted in reverse order beyond that, you'd have to have another version of your code. With the code we have now, though, we could do:

    my @sorted = sort { my_index_sort($a,$b) || $b cmp $a } @list;

    Regarding the Schwartzian transform: I wanted you to be aware of the technique, because it can let you do some pretty interesting things. When I use them, it's generally because I think to myself, "Oh, this would be *so* much easier to sort if everything looked like XYZ". Then I try to figure out an easy way to take what I have and turn it into XYZ.

    In the case of sorting names, for example. As you know, to handle names well, you'll have a good bit of testing and comparison and such to arrange the name(s). and there are quite a few special cases to worry about. In this case, I'd think to myself. "Oh, this would be *so* much easier to sort if the names were split into first, middle and last name with the extra bits at the end. Then I could just sort them normally." Then I figure out how to do that.

    sub convert_name_to_L_F_M_Extras { # Lots of ugly code to split the name into bits, determine which b +it goes where, and # return an array of [ LastName, FirstName, MiddleName, Extras ], +like so: # "Dr. John Smith" => [ Smith, John, undef, Dr. ] # "Eucalyptus A. Tree, Esquire" => [ Tree, Eucalyptus, A., Esquire + ] return [ $last, $first, $middle, $rest ]; } my @list = ( "Dr. John Smith", "Eucalyptus A. Tree, Esquire", "Robotic +us", "Lady Aleena, Baker of cookies" ); my @transformed_list = map { [ convert_name_to_L_F_M_Extras($_), $_ ], + } @list; # Now looks like: # @transformed_list = ( # [ [ "Smith", "John", undef, "Dr" ], "Dr. John Smith" +], # [ [ "Tree", "Eucalyptus", "A.", "Esquire" ], "Eucalyptus A. Tr +ee, Esquire" ], # ... # ); sub sort_by_name { # $a and $b contain an arrayref, the first entry has an arrayref w +ith last, first, etc. # Sort by last name: $a->[0][0] cmp $b->[0][0] # If they have the same last name, sort by first name: || $a->[0][1] cmp $b->[0][1] ... etc ... } my @sorted = sort {sort_by_name($a,$b)} @transformed_list; # Now throw away the transformed names and keep only the list of sorte +d names. @sorted = map { $_->[1] } @sorted;

    Now, having said all that, I mainly mentioned it to help you shift your perspective on sorting. I don't know that you'll necessarily be able to use it, and the case I created above is my best guess of how it *might* be useful to you.


    When your only tool is a hammer, all problems look like your thumb.

      roboticus, I decided to give the name transform a try, and so far it looks pretty good.

      #!/usr/bin/perl use strict; use warnings; my @roman_numerals = qw(I II III IV V VI VII VIII IX X); my $roman_numerals_string = join('|',@roman_numerals); sub name_transform { my $name = shift; # If a suffix is so long it needs a comman, let's get it here. my ($base_name,$pre_suffix) = split(', ',$name); # Split the rest of the name by the spaces. my @name = split(' ',$base_name); # Now check the first array item to see if it is a common prefix. # Some are there for fun. my $prefix = $name[0] =~ /(?:Lady|Lord|[MD][rs]|Mrs|Miss|Pres|Gov|Se +n|officer)(?:|\.)/ ? shift @name : ''; # Now check the last item of the array to see if it matches some com +mon suffixes. # More Roman numerals can be aded. my $suffix = $pre_suffix ? $pre_suffix : $name[-1] =~ /(?:Jr|Sr|Esq| +$roman_numerals_string)(?:|\.)/ ? pop @name : ''; # All which should be left is the bare name. Even if only the first +name is left, # it will be treated as the last name and maybe sorted accordingly. my $last_name = pop @name; my $first_name = shift @name // ''; # Every name left should be middle names. my $middle_name = @name ? join(' ',@name) : ''; return [$last_name,$first_name,$middle_name,$prefix,$suffix]; } local $\ = "\n"; my @names = ('President Barack Hussein Obama II','Mrs. Amanda King','D +r. Feelsgood', 'Miss America','Officer Andy','Henry VIII', "Dr. John Smi +th", "Eucalyptus A. Tree, Esquire","Roboticus", "Lady Aleena, +Baker of cookies", 'Aleena Zarahlinda ibn Robert al-Hajnal Chaoshi-Mnemosyni +od I'); for my $person (@names) { my $names = name_transform($person); print join('|',@{$names}); }


      Obama|Barack|Hussein|President|II King|Amanda||Mrs.| Feelsgood|||Dr.| America|||Miss| Andy|Officer||| Henry||||VIII Smith|John||Dr.| Tree|Eucalyptus|A.||Esquire Roboticus|||| Aleena|||Lady|Baker of cookies Chaoshi-Mnemosyniod|Aleena|Zarahlinda ibn Robert al-Hajnal||I

      What do you think?

      Have a cookie and a very nice day!
      Lady Aleena
        What about Chiang Kai-shek? See Chinese name for more fun.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ


        It looks pretty good. I'm glad I was able to be helpful. Sometimes I suspect that I'm informative without being helpful. ;^)


        When your only tool is a hammer, all problems look like your thumb.

Re^3: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by flexvault (Monsignor) on Mar 03, 2013 at 12:27 UTC


      ...I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples...

    I only used 'Unicode::Collate' once, but it changes the sort order of the ASCII characters as well.

    Do you have a sample input file and the required output file. You seem to be doing a lot of work and then using the Perl 'sort' anyway.

    Good Luck...Ed

    "Well done is better than well said." - Benjamin Franklin

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1021487]
[LanX]: I'm not calling for shooting Trump with a golf ball ;-)
[LanX]: this bullshit theater is distracting us from real issues
[choroba]: he will resign
LanX "I put lipstick on a pig,"
[erix]: it does seem to get out of hand -- but then again, it looks like that for months already :)

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2017-08-18 13:04 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (301 votes). Check out past polls.