Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: best sort

by jpl (Monk)
on Aug 16, 2011 at 13:24 UTC ( #920486=note: print w/ replies, xml ) Need Help??


in reply to Re: best sort
in thread best sort

Calling sort without a comparison function is quite often the wrong thing to do, even on plain text.

There are a couple things that sorting can do for you, but they won't necessarily happen if you supply your own comparison function.

  • You can tell if some string you are interested would come before or after an arbitrary item in the sorted output.
  • All "identical" items will appear together.

Those of us accustomed to dealing solely with 7-bit English characters take advantage of these properties which fall out nicely with the default comparison function. For example, if I am looking for the word "macnulty" and I find "machinery" in the sorted output I know I must look later in the list, and I know that "mable" cannot appear following either. I further know that all occurrences of "machinery" appear together, so, having found "machinery" and something else, I will never again see "machinery" in the list. This is true of all prefixes of strings as well as complete strings.

When one ventures outside of 7-bit code points, as a good citizen of the world must, or into domain-specific applications, like surnames, the "proper" sort order may not preserve these properties. "machinery" may follow both "macnulty" and "mable". I might encounter "mcnulty" between two distinct occurrences of "macnulty" unless someone has been careful to tie-break nominally (an appropriate term in this case) identical surnames using something very like the default comparison function.

The comparison function is a contract, of sorts, between the producer of and the consumers of the sorted output. If the producer and consumers have different expectations, something unseemly is likely to occur. One of my first assignments was to sort white pages listings using rules that were so complex (where does "General John Smith Jr" sort relative to "Doctor John J Smith III"?) that I could be certain that the average white pages user had no clue what the rules actually were. Fortunately, most lists of names were short enough that users could spot the right name without understanding the sort order. Sorting has its place, but indexing on N-grams is proving to be a more user-friendly search mechanism.


Comment on Re^2: best sort
Re^3: best sort
by tchrist (Pilgrim) on Aug 16, 2011 at 15:48 UTC
    There are a couple things that sorting can do for you, but they are won't necessarily happen if you supply your own comparison function.
    • You can tell if some string you are interested would come before or after an arbitrary item in the sorted output.
    • All "identical" items will appear together.
    As you say, it won’t necessarily happen, but it usually does, because you know what your comparison function is doing.
    Tail:
    menorrhoeic pinnisected chimaerid weighable foregone plastochrone Alf gravamen hemen preabdomen metanomen praenomen Carmen acumen tegumen bitumen prolusion Malabar watercress usheress speechcraft unshrew windingly xyzzy
    Length:
    Alf hemen xyzzy acumen Carmen bitumen Malabar tegumen unshrew foregone gravamen usheress chimaerid metanomen praenomen prolusion weighable windingly preabdomen watercress menorrhoeic pinnisected speechcraft plastochrone
    By vowels:
    Alf Malabar gravamen Carmen watercress praenomen plastochrone acumen metanomen preabdomen hemen speechcraft weighable menorrhoeic tegumen chimaerid pinnisected windingly bitumen foregone prolusion unshrew usheress xyzzy
    By consonants:
    bitumen chimaerid acumen Carmen foregone gravamen hemen Alf Malabar menorrhoeic metanomen unshrew plastochrone pinnisected preabdomen prolusion praenomen usheress speechcraft tegumen weighable windingly watercress xyzzy
    Syllable count:
    Alf Carmen foregone hemen speechcraft unshrew acumen bitumen chimaerid gravamen Malabar plastochrone praenomen prolusion tegumen usheress watercress weighable windingly metanomen preabdomen menorrhoeic pinnisected xyzzy
    Vowel count:
    menorrhoeic weighable metanomen praenomen preabdomen chimaerid plastochrone pinnisected foregone prolusion Malabar gravamen speechcraft watercress tegumen usheress windingly acumen bitumen hemen xyzzy Carmen unshrew Alf
    Consonant density:
    xyzzy windingly acumen Alf bitumen Carmen chimaerid foregone gravamen hemen Malabar menorrhoeic metanomen pinnisected plastochrone praenomen preabdomen prolusion speechcraft tegumen unshrew usheress watercress weighable
    Code point summation:
    Alf hemen Carmen xyzzy acumen Malabar bitumen tegumen unshrew gravamen foregone usheress chimaerid weighable metanomen praenomen windingly prolusion preabdomen watercress speechcraft pinnisected menorrhoeic plastochrone
    Of those, about the only one I can’t really pretty much eyeball is the last one, because I know what I am comparing and how. If those were the things I was looking into, I would certainly not want a bare sort for any of them.
      If those were the things I was looking into, I would certainly not want a bare sort for any of them.

      If you anticipate repeated elements, you'd probably want a tie-breaking

      $a cmp $b
      (the bare sort comparison function) on all but the first, because non-identical terms can otherwise compare equal and identical elements may not be adjacent in the sorted output. But I think we are in fundamental agreement: You need to know what you are comparing and how. That may be easier said than done. Knowing that "machenry" is a Scottish name, but "machinery" is not (or is it, and, if not, why not) makes "knowing what you are comparing" non-trivial.
        If those were the things I was looking into, I would certainly not want a bare sort for any of them.
        If you anticipate repeated elements, you'd probably want a tie-breaking
        $a cmp $b
        Well, sure. Here’s some of the code to generate one of those:
        say $_->{PHRASE} for sort {     $b->{TOTAL_VOWELS} <=>  $a->{TOTAL_VOWELS}         ||    $b->{MAX_ANY_VOWEL} <=>  $a->{MAX_ANY_VOWEL}         ||         $b->{NUM_OF_A} <=>  $a->{NUM_OF_A}         ||         $b->{NUM_OF_E} <=>  $a->{NUM_OF_E}         ||         $b->{NUM_OF_I} <=>  $a->{NUM_OF_I}         ||         $b->{NUM_OF_O} <=>  $a->{NUM_OF_O}         ||         $b->{NUM_OF_U} <=>  $a->{NUM_OF_U}         ||         $b->{NUM_OF_Y} <=>  $a->{NUM_OF_Y}         ||         $a->{DICTFOLD} cmp  $b->{DICTFOLD};         ||          $a->{RECNO} <=>  $b->{RECNO}; } @records;
        Look more reasonable?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://920486]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2014-07-13 01:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (244 votes), past polls