Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

RFC: I rewrote a custom sort function, but not sure if it is the best way.

by Lady_Aleena (Chaplain)
on Mar 02, 2013 at 23:54 UTC ( #1021473=perlquestion: print w/ replies, xml ) Need Help??
Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

I rewrote my custom sort function, but I am not sure I did it the best way possible. A few sets of fresh eyeballs might see something obvious where I may have overcomplicated the function. The section I rewrote is in the the first if in the first else. If you make it as far as the elsif for sorting by name, I could use some pointers of how to account for name suffixes like Jr. or III.

Thank you for looking.

sub my_sort { my ($c,$d,$type) = @_; $c = lc($c); $d = lc($d); # When sorting lists of files, I want the index file to always come +first. if ($c =~ /^index\./) { return -1; } elsif ($d =~ /^index\./) { return 1; } else { if (!$type || $type =~ /article/) { # This is the default sorting method. # Written with the help of kent/n in #perl on freenode. for ($c, $d) { s/<.+?>//g; # Strip out any html tags. s/\s*\b(A|a|An|an|The|the)(_|\s)//xi; # Strip off leading arti +cles (in English). decode_entities($_); } if ( $c =~/^((\d|,|\.)+)(.*)$/ && $d =~ /^((\d|,|\.)+)(.*)$/) { # Get the leading number. (my $e = $c) =~ s/^((\d|,|\.)+)(.*)$/$1/; (my $f = $d) =~ s/^((\d|,|\.)+)(.*)$/$1/; # Take any commas out of the number. s/,//g for ($e,$f); # Get the remaining parts of the string. (my $g = $c) =~ s/^((\d|,|\.)+)(.*)$/$3/; (my $h = $d) =~ s/^((\d|,|\.)+)(.*)$/$3/; # First compare the numbers, then compare the remaining parts +of the string. $e <=> $f || $g cmp $h } else { $c cmp $d; } } elsif ($type =~ /name/) { # When I sort by name I prefer lastname firstname. # I have not yet written this to account for Sr., Jr., or Roman +numerals after the last name. for ($c,$d) { s/\|.+$//; $_ = join(' ', (reverse split(/(?:_|\s)(?=[^_\s]+$)/, $_,2))) +if $_ !~ /^_/; s/^_//; s/^(A|a|An|an|The|the)(_|\s)//; } return $c cmp $d; } else { die(qq($type is not valid in the sort.)); } } }
Have a cookie and a very nice day!
Lady Aleena

Comment on RFC: I rewrote a custom sort function, but not sure if it is the best way.
Select or Download Code
Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by davido (Archbishop) on Mar 03, 2013 at 01:24 UTC

    All items matching ^index\. will fail to sort on any component of the token that comes after "index.". That means that there would be no way to really predict how index.b, index.a, index.c would be sorted, except that the merge sort is a "stable" sort.

    Also, cmp is a relatively flawed operator (it's not just Perl; any code sorting using a comparison of ASCII values is equally flawed). You might want to invoke Unicode::Collate's cmp method instead. The cmp operator is predictable, but only sorts "ascii-betically", which nowadays is only useful in accomplishing a predictable order.

    Finally, the amount of work you're doing inside of a sort routine is substantial. If you have 100 items you're sorting, your sort function will be called somewhere around 100 * log2(100) times, or approximately 660 times. If you're sorting 10000 items, it will be called approximately 133000 times. Each regular expression inside the subroutine has some non-zero level of computational complexity. split has linear complexity...and so on. All this begins to add up fast as you multiply it by how many times the subroutine is invoked in the course of sorting your list. You might wish, instead, to pre-compute and cache, using something like the Schwartzian Transform, or the Guttman-Rosler Transform, where you precompute the sort keys just one time each, and then sort based on those keys. To some degree it would require rewriting your routine. But it would be an excellent learning exercise.

    Of course you should only worry about the computational complexity if you intend this solution to be used on lists large enough for it to matter. A few hundred or even a few thousand items probably isn't worth the effort. But if you find yourself initiating a sort and then having time to walk away and pop a bag of popcorn in the microwave, it's time to start exploring the transforms. Or just enjoy the break. ;)


    Dave

Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by roboticus (Canon) on Mar 03, 2013 at 02:13 UTC

    O provider of cookies:

    I have a few notes for you, but I'm going to apologize in advance--I don't have enough time to shorten this and clean it up nicely. So this is one of my occasional hideously long nodes, and is a bit rambly, and I'll leave out a few bits that I meant to say. I'll make use of the readonly tag, for those who don't want to see it all. With that disclaimer, here goes...

    • Your code to move index files to the top can fail because your comparison function isn't "stable".

    • The next thing I'd like to mention is that you're doing a good deal of work to make a single sort routine, with a parameter to tell it what to do. I'd suggest splitting your sort routine into smaller pieces.

    • Next, your subroutine does a *lot* of manipulation, and since it will be called multiple times for every string in the collection, you could be doing a lot more work than necessary. For small lists, that's no big deal. But it can add up quickly. There's a technique called the Schwartzian transform that can greatly reduce data manipulation during sorting.

    Update: When I started writing this node, there were no responses. But now davido already hit all the the high points. On the bright side, now I have a little reading material for tomorrow. I've not heard of the Guttman-Rosler transform. (Update 2: Well, Guttman-Rosler is described in the node I referenced, so I'm guessing I just forgot about it....)

    Update: s/albatrioss/albatross/ig; fixed a broken document link. (update very late: 20140811)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      This may be disjointed, I am a little tired after the reading and all the other things I did today. With what I have so far, I have 47 instances of my_sort to alter.

      Thank you roboticus for you really long explanation of where I went wrong. I made most of the changes you recommended with a few exceptions.

      • I did not use my_sort_index since I do not know what you had in mind for it. However, I want it in both sort functions since I have directories with files I want sorted by name with their parent index.(pl|shtml|html) file in the mix.
      • I can not figure out how to get the Schwartzian transform into the functions. Where do they go in my_article_sort and my_name_sort? Are they a whole other function?

      davido, I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples. The way I use the sort functions now is as follows.

      for my $movie (sort { my_article_sort($a->{title},$b->{title}) values +%movies} { ... } for my $alpha (sort { my_article_sort($a,$b) } keys %alpha_movies) { . +.. } for my $character (sort { my_name_sort($a->{name},$b->{name}) values % +player_characters} { ... } for my $color (sort { my_article_sort($a->{name},$b->{name}) values %c +olors} { ... } my @files = (map("$data_dir$_",grep(/txt$/,sort { my_article_sort($a,$ +b) } readdir($directory))));

      I looked at both the transform links but got really lost. As stated above, I do not know how they can be added to the two sort functions. Also, I still have problems with map BLOCK (and grep BLOCK). After about six or eight tries to get a map to work, I usually end up doing a for loop to get the desired result whether it be a concatenation, some regex, or math. It has been a long time since I wrote a map, and at the time I had to have someone holding my hand while I was doing it. The map/grep above is about the best I can do at the moment.

      7stud, I have begun work on putting the dispatch table into a function. I am not sure of my approach just yet. At least the functions are no longer longer than the screen height. Oh, I looked at Sort::Maker. I do not understand it one bit.

      The following is where I am at the moment. (Please ignore short_sorts, it is still a mess. I hardly ever use it.)

      Have a cookie and a very nice day!
      Lady Aleena

        Lady_Aleena,

          ...I looked at Unicode::Collate. I can not see how it would work with hash values. The writer only uses arrays in the examples...

        I only used 'Unicode::Collate' once, but it changes the sort order of the ASCII characters as well.

        Do you have a sample input file and the required output file. You seem to be doing a lot of work and then using the Perl 'sort' anyway.

        Good Luck...Ed

        "Well done is better than well said." - Benjamin Franklin

        Lady_Aleena

        Regarding "my_sort_index": Oh, yeah, I totally forgot to write that.

        Well, when writing comparison routines, you first want to determine what all the cases are that you want to handle, and what to do for each case. For ensuring that strings beginning with "index." sort to the top of the list, there are four cases:

        1. Neither string has the desired prefix: For this case, we simply return 0 and let the next bit of code choose the order.
        2. Only the first string has the desired prefix: Here, we return -1 to make the first string rise to the top of the list.
        3. Only the second string has the desired prefix: Here, we return +1 to make the second string rise to the top of the list.
        4. Both strings have the desired prefix: In this case, we don't really care which one is first. The first three cases ensure that the strings with the prefix float to the top. So we'll return 0 to let the next bit of code check the rest of the order.

        So, here's my (untested!) version of the my_sort_index routine:

        sub my_sort_index { my ($c, $d) = @_; # We want any item starting with "index." to sort to the top of # the list. The rest of the ordering is handled by the caller. # So there are four cases we must handle: # # Matches? # $c $d return # No No 0 # Yes No -1 # No Yes +1 # Yes Yes 0 (-1 + +1 == 0) # # In the fourth case, when both strings have the "index." prefix, # we want them to sort via the normal method so that "index.1" wil +l # always sort before "index.2". my $return_value = 0; --$return_value if $c =~ /^index\./; ++$return_value if $d =~ /^index\./; return $return_value; }

        We could enforce the order in the fourth case, but it would make the subroutine a bit less useful. Why? It makes the subroutine harder to use with other subroutines for a sorting job, because it forces an order beyond just making sure that strings prefixed with "index." go first. In other words, if neither string began with index, it would use a different ordering than if both strings began with index. That can be problem.

        Suppose you want all strings with "index." to sort to the top of the list, but you want everything sorted in reverse order beyond that, you'd have to have another version of your code. With the code we have now, though, we could do:

        my @sorted = sort { my_index_sort($a,$b) || $b cmp $a } @list;

        Regarding the Schwartzian transform: I wanted you to be aware of the technique, because it can let you do some pretty interesting things. When I use them, it's generally because I think to myself, "Oh, this would be *so* much easier to sort if everything looked like XYZ". Then I try to figure out an easy way to take what I have and turn it into XYZ.

        In the case of sorting names, for example. As you know, to handle names well, you'll have a good bit of testing and comparison and such to arrange the name(s). and there are quite a few special cases to worry about. In this case, I'd think to myself. "Oh, this would be *so* much easier to sort if the names were split into first, middle and last name with the extra bits at the end. Then I could just sort them normally." Then I figure out how to do that.

        sub convert_name_to_L_F_M_Extras { # Lots of ugly code to split the name into bits, determine which b +it goes where, and # return an array of [ LastName, FirstName, MiddleName, Extras ], +like so: # "Dr. John Smith" => [ Smith, John, undef, Dr. ] # "Eucalyptus A. Tree, Esquire" => [ Tree, Eucalyptus, A., Esquire + ] return [ $last, $first, $middle, $rest ]; } my @list = ( "Dr. John Smith", "Eucalyptus A. Tree, Esquire", "Robotic +us", "Lady Aleena, Baker of cookies" ); my @transformed_list = map { [ convert_name_to_L_F_M_Extras($_), $_ ], + } @list; # Now looks like: # @transformed_list = ( # [ [ "Smith", "John", undef, "Dr" ], "Dr. John Smith" +], # [ [ "Tree", "Eucalyptus", "A.", "Esquire" ], "Eucalyptus A. Tr +ee, Esquire" ], # ... # ); sub sort_by_name { # $a and $b contain an arrayref, the first entry has an arrayref w +ith last, first, etc. # Sort by last name: $a->[0][0] cmp $b->[0][0] # If they have the same last name, sort by first name: || $a->[0][1] cmp $b->[0][1] ... etc ... } my @sorted = sort {sort_by_name($a,$b)} @transformed_list; # Now throw away the transformed names and keep only the list of sorte +d names. @sorted = map { $_->[1] } @sorted;

        Now, having said all that, I mainly mentioned it to help you shift your perspective on sorting. I don't know that you'll necessarily be able to use it, and the case I created above is my best guess of how it *might* be useful to you.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: RFC: I rewrote a custom sort function, but not sure if it is the best way.
by 7stud (Deacon) on Mar 03, 2013 at 02:15 UTC

    Let's start with the premise that a sort routine shouldn't be longer than 10 lines. What can you do to comply with that restriction?

    It looks to me like you are doing a lot of array processing inside the sort routine. Why not process the whole array first, then apply your sort routine? The advantage of processing your array before sorting is that you only touch each element once, where sort touches the elements multiple times.

    The next question is: why do you think you need one sort routine to sort every conceivable array you might conjure up? A better way would be to do something like this:

    use strict; use warnings; use 5.010; my %sort_by = ( article => \&article_sorter, name => \&name_sorter, subject => \&subject_sorter, ); my @results = $sort_by{name} @data

    Once again, by doing that you are moving some of the processing out of the sort routine by checking for the type before calling sort().

    Next comes advanced sorting, i.e. efficient sorting. As davido mentioned, it is a great learning process to learn how to do a Schwartzian Transform or use an Orcish Manouvre in your sort routine. Okay, even though they sound like theorems you might learn about in a Theoretcial Physics class, they are quite easy to grok.

    But if you fail to fully grasp the above means for efficient sorting, not to worry--there is a module that will help you out called Sort::Maker.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1021473]
Approved by davido
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-10-01 20:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (38 votes), past polls