P is for Practical | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
I almost never deal with people’s names; I think doing so is pretty rare.
However, except for the Mc/Mac thing, whose solution I didn’t actually include because it requires more explanation than I cared to go into then (you need to supply a preprocess transform to the constructor), all the rest is pretty standard text stuff.
Almost the only time I sort stuff is for output. Otherwise order is probably immaterial. And if I’m sorting for output, I want to be able to look at text in the order that English has used for hundreds of years to sort its text. I do not want to look at things in random code order; down that road lies madness. You can look at the the default UCA sort as a very good approximation to historical order of text. No, it isn’t perfect, but everyone is equally disadvantaged. For example, it doesn’t do the English Mc/Mac trick unassisted. What is does do, though, seems to be the logical thing for text in a way that code point order never is. Explaining it makes it seem harder than it is, but it isn’t hard; it’s what makes sense for text, and it’s how we’ve done it forever. This is a little bit of a simplification, but it works essentially this way:
You don’t have to do all those if you don’t want. You can for example tell it to use only the primary strength, which only considers basic letters and absolutely nothing else, even case. (That’s how you do an “accent‐insensitive” string comparison, BTW, using your collator object’s eq method.) If you wanted it to ignore case but consider accents for level one ties, just set it to do only the first two stages and skip the rest. You can even define more levels that the standard four if you want, and all kinds of things can be tweaked, but the default is usually good enough. This works how pretty well for English text, and it still works even if it has non‐Latin mixed in with it here and there, since Greek letters would sort within their own alphabet, &c. Strings like Joe Bob, Joe‐Bob, Joebob, and Jo E. Bob are all the same at the first two levels, and only diverge after that, with case differences at the third level and nonletters at the fourth. Yes, this really does make it easier to read lists of text. I have processed thousands and thousands of output reports of English text this way, and it really does. If you don’t like how it by default ignores nonletters (which is a bit misleading of me to put it that way, since numbers come before letters and such; it isn’t as stupid as I may have made it sound) at the primary comparison stage, that is tweakable. I guess one way to think of the default UCA sort in Perl pseudocode would be: Like the code above, the multilevel sort short circuits when it finds a difference between its operands at the level. You can set it to do only N levels if you want. That presumes that those functions are defined to return the right sort of magic number. It doesn’t account for how the default ignorables get skipped unless you manage to fall through to the fourth level. Remember that letter positions aren’t the same as equivalent indices of arrays. I bet I’ve managed to convince everyone that nobody needs anything as complicated as this and that you should all go back to garbage ordering. That certainly isn’t my intent. If you just try the collator object’s default sort on text the next time you have to sort some text, without any tweaking or anything, I think you will be aesthetically pleased with the results. More pleased, in fact, than you are(n’t) by the code point garbage order you get with the unmodified builtin sort. In other words: “Try it, you’ll like it!”
In reply to Re^3: best sort
by tchrist
|
|