comment on

I almost never deal with people’s names; I think doing so is pretty rare. However, except for the Mc/Mac thing, whose solution I didn’t actually include because it requires more explanation than I cared to go into then (you need to supply a preprocess transform to the constructor), all the rest is pretty standard text stuff.

Almost the only time I sort stuff is for output. Otherwise order is probably immaterial. And if I’m sorting for output, I want to be able to look at text in the order that English has used for hundreds of years to sort its text. I do not want to look at things in random code order; down that road lies madness.

You can look at the the default UCA sort as a very good approximation to historical order of text. No, it isn’t perfect, but everyone is equally disadvantaged. For example, it doesn’t do the English Mc/Mac trick unassisted.

What is does do, though, seems to be the logical thing for text in a way that code point order never is. Explaining it makes it seem harder than it is, but it isn’t hard; it’s what makes sense for text, and it’s how we’ve done it forever.

This is a little bit of a simplification, but it works essentially this way:

1. Primary strength: Compare to see whether the basic letters are the same. Ignore nonletters at this stage; just skip ahead till you find a letter. If the letters aren’t the same for the same relative position, there is an established dictionary order about what goes first. If you are a user of the Latin alphabet, this will be in the order of the abc’s you learned in school, so Fred comes before freedom, as does free beer. The reason it put free beer in front of freedom is because the fifth letter in the first string is b, and that comes before the fifth letter in the second string, which is d. See how that works? That’s dictionary order. We aren’t doing a field sort here.
2. Secondary strength: If the letters are the same, then check whether the diacritics are the same. By default we resolve ties by looking at the diacritics reading left to right, but this can be flipped to do so right to left to keep the French less unhappy. (The classic demo is that normal LTR tie‐breaking order sorts cote < coté < côte < côté, whereas the French RTL tie‐breaking order for diacritics sorts cote < côte < coté < côté. Yes, really, and I’m sorry, but it’s truly not my fault. It has to do with their inflectional morphology, which is tail‐based.)
3. Tertiary strength: If the letters and the diacritics are the same, then check whether the case is the same. By default, lowercase precedes uppercase, but this is easy to flip.
4. Quaternary strength: If the letters, the diacritics, and the case are all the same for a given position, now you may go back and reconsider any nonletters, like punctuation and symbols and whitespace.

You don’t have to do all those if you don’t want. You can for example tell it to use only the primary strength, which only considers basic letters and absolutely nothing else, even case. (That’s how you do an “accent‐insensitive” string comparison, BTW, using your collator object’s eq method.) If you wanted it to ignore case but consider accents for level one ties, just set it to do only the first two stages and skip the rest.

You can even define more levels that the standard four if you want, and all kinds of things can be tweaked, but the default is usually good enough. This works how pretty well for English text, and it still works even if it has non‐Latin mixed in with it here and there, since Greek letters would sort within their own alphabet, &c.

Strings like Joe Bob, Joe‐Bob, Joebob, and Jo E. Bob are all the same at the first two levels, and only diverge after that, with case differences at the third level and nonletters at the fourth. Yes, this really does make it easier to read lists of text. I have processed thousands and thousands of output reports of English text this way, and it really does.

If you don’t like how it by default ignores nonletters (which is a bit misleading of me to put it that way, since numbers come before letters and such; it isn’t as stupid as I may have made it sound) at the primary comparison stage, that is tweakable.

I guess one way to think of the default UCA sort in Perl pseudocode would be:

primary($a)    <=>   primary($b) 
                ||
secondary($a)  <=>   secondary($b) 
                ||
tertiary($a)   <=>   tertiary($b) 
                ||
quaternary($a) <=>   quaternary($b) 
[download]

Like the code above, the multilevel sort short circuits when it finds a difference between its operands at the level. You can set it to do only N levels if you want. That presumes that those functions are defined to return the right sort of magic number. It doesn’t account for how the default ignorables get skipped unless you manage to fall through to the fourth level. Remember that letter positions aren’t the same as equivalent indices of arrays.

I bet I’ve managed to convince everyone that nobody needs anything as complicated as this and that you should all go back to garbage ordering. That certainly isn’t my intent. If you just try the collator object’s default sort on text the next time you have to sort some text, without any tweaking or anything, I think you will be aesthetically pleased with the results. More pleased, in fact, than you are(n’t) by the code point garbage order you get with the unmodified builtin sort.

In other words: “Try it, you’ll like it!”

DISCLAIMER:
As always, your mileage may vary — except of course in Europe where this statement is illegal, in which case you didn’t read it anyway so no harm no foul.

In reply to Re^3: best sort by tchrist
in thread best sort by ag4ve

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks