Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.

by roboticus (Canon)
on Mar 03, 2013 at 02:13 UTC ( #1021481=note: print w/ replies, xml ) Need Help??


in reply to RFC: I rewrote a custom sort function, but not sure if it is the best way.

O provider of cookies:

I have a few notes for you, but I'm going to apologize in advance--I don't have enough time to shorten this and clean it up nicely. So this is one of my occasional hideously long nodes, and is a bit rambly, and I'll leave out a few bits that I meant to say. I'll make use of the readonly tag, for those who don't want to see it all. With that disclaimer, here goes...

  • Your code to move index files to the top can fail because your comparison function isn't "stable".

    Near the end of perldoc -f sort, they have the sentence: A stable sort preserves the input order of elements that compare equal. The reason it's important is that many sorting algorithms don't check the relationship of all items, because if you provide a stable comparison function, it can greatly reduce the number of comparisons it has to make. But if you don't provide a stable comparison function, it'll put things in an incorrect order. The problem is this:

    # When sorting lists of files, I want the index file to always come fi +rst. if ($c =~ /^index\./) { return -1; } elsif ($d =~ /^index\./) { return 1; }

    Below are the results of a few calls to your subroutine. As you can see, if the first argument matches /^index\./, then it tells the sort routine to make that item precede the second item, no matter what the second item is!.

    $ perl PM1021473_3_unstable.pl index.0001 afile article -1 index.0001 index.0005 article -1 index.9999 index.0005 article -1

    The code:

    #!/usr/bin/perl # # PM1021473_3_unstable.pl # use strict; use warnings; use HTML::Entities; while (<DATA>) { s/\s+$//; my @args = split /\s+/,$_; push @args, my_sort(@args); printf "%-10s %-10s %-10s %d\n", @args; } ... your sort routine ... __DATA__ index.0001 afile article index.0001 index.0005 article index.9999 index.0005 article
  • The next thing I'd like to mention is that you're doing a good deal of work to make a single sort routine, with a parameter to tell it what to do. I'd suggest splitting your sort routine into smaller pieces.

    At the very least, I'd split it in two chunks:

    sub my_sort_article { my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); my $t = my_sort_index($c, $d); return $t if $t; # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading articles + (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { # Get the leading number. (my $e = $c) =~ s/^((\d|,|\.)+)(.*)$/$1/; (my $f = $d) =~ s/^((\d|,|\.)+)(.*)$/$1/; # Take any commas out of the number. s/,//g for ($e,$f); # Get the remaining parts of the string. (my $g = $c) =~ s/^((\d|,|\.)+)(.*)$/$3/; (my $h = $d) =~ s/^((\d|,|\.)+)(.*)$/$3/; # First compare the numbers, then compare the remaining parts of t +he string. $e <=> $f || $g cmp $h } else { $c cmp $d; } } sub my_sort_name { # When I sort by name I prefer lastname firstname. # I have not yet written this to account for Sr., Jr., or Roman nume +rals after the last name. my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); # ?Did you really want this check for name sorting? my $t = my_sort_index($c, $d); return $t if $t; for ($c,$d) { s/\|.+$//; $_ = join(' ', (reverse split(/(?:_|\s)(?=[^_\s]+$)/, $_,2))) if $ +_ !~ /^_/; s/^_//; s/^(A|a|An|an|The|the)(_|\s)//; } return $c cmp $d; }

    It's no harder to call this way, and updating one won't break the other.

    In fact you can go a good deal further. Usually it's best to build with smaller subroutines. In fact, you could easily pull other interesting functions out that you could re-use. By breaking them into small pieces, and giving them good names, it makes your code easier to read and maintain in addition to giving you more bits of useful code for your toolbox.

    For example, splitting out the leading number and tuning it up could be a great little routine you could use in plenty of locations:

    sub split_out_leading_number { my $s = shift; if ( $s =~/^((\d|,|\.)+)(.*)$/) { my ($leading_number, $rest) = ($1,$3); # Take any commas out of the number. $leading_number =~ s/,//g; return ($leading_number, $rest); } die "split_out_leading_number received bogus input '$s'!\n"; }

    And once you have it in your toolbox, writing code using it is a little easier to read. Just take a look:

    sub my_sort_article { my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); my $t = my_sort_index($c, $d); return $t if $t; # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading articles + (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { my ($num1, $text1) = split_out_leading_number($c); my ($num2, $text2) = split_out_leading_number($d); # First compare the numbers, then compare the remaining parts of t +he string. $num1 <=> $num2 || $text1 cmp $text2 } else { $c cmp $d; } }

    But you want to be careful with your naming. I don't like the name I came up with, but haven't thought of anything better yet. With good naming, you can remove many comments from your code, as it can become more self-documenting.

  • Next, your subroutine does a *lot* of manipulation, and since it will be called multiple times for every string in the collection, you could be doing a lot more work than necessary. For small lists, that's no big deal. But it can add up quickly. There's a technique called the Schwartzian transform that can greatly reduce data manipulation during sorting.

    For a quicksort, each time you double the size of your list, you have to transform *each* item on your list another time. It can grow fairly quickly:

    List sizeTimes each item
    is manipulated
    Total manipulations
    2 1 2
    4 2 4*2=8
    8 3 8*3=24
    16 4 16*4=64
    . . . . . . . . .
    1024 10 1024*10=10,240
    . . . . . . . . .
    1048576 201048576*20=20,971,520
    Nlog(N)/log(2)N*log(N)/log(2)

    In the Schwartzian transform, you simply build a *new* list that's much easier to sort, transforming all your data items once. Then you sort the new list. After the sort, you pull out the original data items to get your sorted list.

    Suppose that you had a large list of words in mixed case, and wanted to sort them alphabetically, irrespective of case. But you want your resulting list to retain the original case of the words. You could do it like:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); my @result = sort {uc($a) cmp uc($b)} @original;

    but it would uppercase all items repeatedly. If your list was large enough that could slow your application down. But suppose your list was more like:

    my @list = ( [qw( APPLE apple )], [qw( ALPHA alpha )], [qw( ALBATROSS aLBAtross )], [qw( ALBACORE aLbAcOrE)], etc... );

    That would be much easier to sort: Just sort the array in the order of the first column:

    my @result = sort { $a->[0] cmp $b->[0] } @list;

    And then your list would look like this:

    my @list = ( [qw( ALBACORE aLbAcOrE)], [qw( ALBATROSS aLBAtross )], [qw( ALPHA alpha )], [qw( APPLE apple )], etc... );

    Retrieving the sorted data is simple enough: It's the second column of the array. Reading a discussion of the Schwartzian transform was one of the things that got me hooked on perl. (In fact, it was this article: Resorting to Sorting. It led me to realize how flexible perl can be. It can make sort work of simple list operations. For example, you can use the map function (perldoc -f map) to take your original list and reshape it. So to do our sort we could do it like this:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); # Convert our list into the ( [APPLE apple], ... ) form with map: @result = map { [ uc($_), $_ ] } @list; # Now sort the list: @result = sort {$a->[0] cmp $b->[0]} @result; # Now convert the resulting list back into (aLbAcOrE aLBAtross...) for +mat: @result = map { $_->[1] } @result;

    This is kind of wordy, though. Perl lets us pipeline the operations, though, so we can do it in a single statement:

    my @list = ( qw( apple Alpha aLBAtross aLbAcOre etc... ) ); my @result = map {$_->[1] } # .sgnirts desacreppu eht + lla tou pirts # sort {$a->[0] cmp $b->[0]} # neht dna ,sgnirts desacrep +pu eht no tros # map { [ uc($_), $_ ] } # ,gnirts eht fo noisrev desa +creppu eht dda # @list; # + ,tsil a neviG #

    It takes a little getting used to, but just remember that the list operations (sort, map, grep) all take a list on the right side as input, and output a new list on the left side. So when you chain them together, just remember to read them from right to left. ;^D

    If you wanted to, you could even throw a grep function in there to only retain words fulfilling specific criteria. You might only want words longer than four characters for example.

Update: When I started writing this node, there were no responses. But now davido already hit all the the high points. On the bright side, now I have a little reading material for tomorrow. I've not heard of the Guttman-Rosler transform. (Update 2: Well, Guttman-Rosler is described in the node I referenced, so I'm guessing I just forgot about it....)

Update: s/albatrioss/albatross/ig; fixed a broken document link. (update very late: 20140811)

...roboticus

When your only tool is a hammer, all problems look like your thumb.


Comment on Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.
Select or Download Code
Re^2: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by Lady_Aleena (Chaplain) on Mar 03, 2013 at 07:47 UTC

    This may be disjointed, I am a little tired after the reading and all the other things I did today. With what I have so far, I have 47 instances of my_sort to alter.

    Thank you roboticus for you really long explanation of where I went wrong. I made most of the changes you recommended with a few exceptions.

    • I did not use my_sort_index since I do not know what you had in mind for it. However, I want it in both sort functions since I have directories with files I want sorted by name with their parent index.(pl|shtml|html) file in the mix.
    • I can not figure out how to get the Schwartzian transform into the functions. Where do they go in my_article_sort and my_name_sort? Are they a whole other function?

    davido, I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples. The way I use the sort functions now is as follows.

    for my $movie (sort { my_article_sort($a->{title},$b->{title}) values +%movies} { ... } for my $alpha (sort { my_article_sort($a,$b) } keys %alpha_movies) { . +.. } for my $character (sort { my_name_sort($a->{name},$b->{name}) values % +player_characters} { ... } for my $color (sort { my_article_sort($a->{name},$b->{name}) values %c +olors} { ... } my @files = (map("$data_dir$_",grep(/txt$/,sort { my_article_sort($a,$ +b) } readdir($directory))));

    I looked at both the transform links but got really lost. As stated above, I do not know how they can be added to the two sort functions. Also, I still have problems with map BLOCK (and grep BLOCK). After about six or eight tries to get a map to work, I usually end up doing a for loop to get the desired result whether it be a concatenation, some regex, or math. It has been a long time since I wrote a map, and at the time I had to have someone holding my hand while I was doing it. The map/grep above is about the best I can do at the moment.

    7stud, I have begun work on putting the dispatch table into a function. I am not sure of my approach just yet. At least the functions are no longer longer than the screen height. Oh, I looked at Sort::Maker. I do not understand it one bit.

    The following is where I am at the moment. (Please ignore short_sorts, it is still a mess. I hardly ever use it.)

    Have a cookie and a very nice day!
    Lady Aleena

      Lady_Aleena,

        ...I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples...

      I only used 'Unicode::Collate' once, but it changes the sort order of the ASCII characters as well.

      Do you have a sample input file and the required output file. You seem to be doing a lot of work and then using the Perl 'sort' anyway.

      Good Luck...Ed

      "Well done is better than well said." - Benjamin Franklin

      Lady_Aleena

      Regarding "my_sort_index": Oh, yeah, I totally forgot to write that.

      Well, when writing comparison routines, you first want to determine what all the cases are that you want to handle, and what to do for each case. For ensuring that strings beginning with "index." sort to the top of the list, there are four cases:

      1. Neither string has the desired prefix: For this case, we simply return 0 and let the next bit of code choose the order.
      2. Only the first string has the desired prefix: Here, we return -1 to make the first string rise to the top of the list.
      3. Only the second string has the desired prefix: Here, we return +1 to make the second string rise to the top of the list.
      4. Both strings have the desired prefix: In this case, we don't really care which one is first. The first three cases ensure that the strings with the prefix float to the top. So we'll return 0 to let the next bit of code check the rest of the order.

      So, here's my (untested!) version of the my_sort_index routine:

      sub my_sort_index { my ($c, $d) = @_; # We want any item starting with "index." to sort to the top of # the list. The rest of the ordering is handled by the caller. # So there are four cases we must handle: # # Matches? # $c $d return # No No 0 # Yes No -1 # No Yes +1 # Yes Yes 0 (-1 + +1 == 0) # # In the fourth case, when both strings have the "index." prefix, # we want them to sort via the normal method so that "index.1" wil +l # always sort before "index.2". my $return_value = 0; --$return_value if $c =~ /^index\./; ++$return_value if $d =~ /^index\./; return $return_value; }

      We could enforce the order in the fourth case, but it would make the subroutine a bit less useful. Why? It makes the subroutine harder to use with other subroutines for a sorting job, because it forces an order beyond just making sure that strings prefixed with "index." go first. In other words, if neither string began with index, it would use a different ordering than if both strings began with index. That can be problem.

      Suppose you want all strings with "index." to sort to the top of the list, but you want everything sorted in reverse order beyond that, you'd have to have another version of your code. With the code we have now, though, we could do:

      my @sorted = sort { my_index_sort($a,$b) || $b cmp $a } @list;

      Regarding the Schwartzian transform: I wanted you to be aware of the technique, because it can let you do some pretty interesting things. When I use them, it's generally because I think to myself, "Oh, this would be *so* much easier to sort if everything looked like XYZ". Then I try to figure out an easy way to take what I have and turn it into XYZ.

      In the case of sorting names, for example. As you know, to handle names well, you'll have a good bit of testing and comparison and such to arrange the name(s). and there are quite a few special cases to worry about. In this case, I'd think to myself. "Oh, this would be *so* much easier to sort if the names were split into first, middle and last name with the extra bits at the end. Then I could just sort them normally." Then I figure out how to do that.

      sub convert_name_to_L_F_M_Extras { # Lots of ugly code to split the name into bits, determine which b +it goes where, and # return an array of [ LastName, FirstName, MiddleName, Extras ], +like so: # "Dr. John Smith" => [ Smith, John, undef, Dr. ] # "Eucalyptus A. Tree, Esquire" => [ Tree, Eucalyptus, A., Esquire + ] return [ $last, $first, $middle, $rest ]; } my @list = ( "Dr. John Smith", "Eucalyptus A. Tree, Esquire", "Robotic +us", "Lady Aleena, Baker of cookies" ); my @transformed_list = map { [ convert_name_to_L_F_M_Extras($_), $_ ], + } @list; # Now looks like: # @transformed_list = ( # [ [ "Smith", "John", undef, "Dr" ], "Dr. John Smith" +], # [ [ "Tree", "Eucalyptus", "A.", "Esquire" ], "Eucalyptus A. Tr +ee, Esquire" ], # ... # ); sub sort_by_name { # $a and $b contain an arrayref, the first entry has an arrayref w +ith last, first, etc. # Sort by last name: $a->[0][0] cmp $b->[0][0] # If they have the same last name, sort by first name: || $a->[0][1] cmp $b->[0][1] ... etc ... } my @sorted = sort {sort_by_name($a,$b)} @transformed_list; # Now throw away the transformed names and keep only the list of sorte +d names. @sorted = map { $_->[1] } @sorted;

      Now, having said all that, I mainly mentioned it to help you shift your perspective on sorting. I don't know that you'll necessarily be able to use it, and the case I created above is my best guess of how it *might* be useful to you.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        roboticus, I decided to give the name transform a try, and so far it looks pretty good.

        #!/usr/bin/perl use strict; use warnings; my @roman_numerals = qw(I II III IV V VI VII VIII IX X); my $roman_numerals_string = join('|',@roman_numerals); sub name_transform { my $name = shift; # If a suffix is so long it needs a comman, let's get it here. my ($base_name,$pre_suffix) = split(', ',$name); # Split the rest of the name by the spaces. my @name = split(' ',$base_name); # Now check the first array item to see if it is a common prefix. # Some are there for fun. my $prefix = $name[0] =~ /(?:Lady|Lord|[MD][rs]|Mrs|Miss|Pres|Gov|Se +n|officer)(?:|\.)/ ? shift @name : ''; # Now check the last item of the array to see if it matches some com +mon suffixes. # More Roman numerals can be aded. my $suffix = $pre_suffix ? $pre_suffix : $name[-1] =~ /(?:Jr|Sr|Esq| +$roman_numerals_string)(?:|\.)/ ? pop @name : ''; # All which should be left is the bare name. Even if only the first +name is left, # it will be treated as the last name and maybe sorted accordingly. my $last_name = pop @name; my $first_name = shift @name // ''; # Every name left should be middle names. my $middle_name = @name ? join(' ',@name) : ''; return [$last_name,$first_name,$middle_name,$prefix,$suffix]; } local $\ = "\n"; my @names = ('President Barack Hussein Obama II','Mrs. Amanda King','D +r. Feelsgood', 'Miss America','Officer Andy','Henry VIII', "Dr. John Smi +th", "Eucalyptus A. Tree, Esquire","Roboticus", "Lady Aleena, +Baker of cookies", 'Aleena Zarahlinda ibn Robert al-Hajnal Chaoshi-Mnemosyni +od I'); for my $person (@names) { my $names = name_transform($person); print join('|',@{$names}); }

        returns

        Obama|Barack|Hussein|President|II King|Amanda||Mrs.| Feelsgood|||Dr.| America|||Miss| Andy|Officer||| Henry||||VIII Smith|John||Dr.| Tree|Eucalyptus|A.||Esquire Roboticus|||| Aleena|||Lady|Baker of cookies Chaoshi-Mnemosyniod|Aleena|Zarahlinda ibn Robert al-Hajnal||I

        What do you think?

        Have a cookie and a very nice day!
        Lady Aleena

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1021481]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2014-09-20 23:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (164 votes), past polls