Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: best sort

by tchrist (Pilgrim)
on Aug 14, 2011 at 23:12 UTC ( [id://920264]=note: print w/replies, xml ) Need Help??


in reply to best sort

NOTE: The following is a previously unpublished excerpt from the draft manuscript of Programming Perl, 4ᵗʰ edition, slightly modified for this answer.


Calling sort without a comparison function is quite often the wrong thing to do, even on plain text. That’s because if you use a bare sort, you can get really strange results. It’s not just Perl either: almost all programming languages work this way, even the shell command. You might be surprised to find that with this sort of nonsense sort, ‹B› comes before ‹a› not after it, ‹é› comes before ‹d›, and ‹ff› comes after ‹zz>. There’s no end to such silliness, either; see the default sort tables at the end of this posting to see what I mean.

There are situations when a bare sort is appropriate, but fewer than you think. One scenario is when every string you’re sorting contains nothing but the 26 lowercase (or uppercase, but not both) Latin letters from ‹a–z›, without any whitespace or punctuation.

Another occasion when a simple, unadorned sort is appropriate is when you have no other goal but to iterate in an order that is merely repeatable, even if that order should happen to be completely arbitrary. In other words, yes, it’s garbage, but it’s the same garbage this time as it was last time. That’s because the default sort resorts to an unmediated cmp operator, which has the “predictable garbage” characteristics I just mentioned.

The last situation is much less frequent than the first two. It requires that the things you’re sorting be special‐purpose, dedicated binary keys whose bit sequences have with excruciating care been arranged to sort in some prescribed fashion. This is also the strategy for any reasonable use of the cmp operator.

So what’s wrong with sort, anyway?

I know, I know. I can hear everyone saying, “But it’s called sort, so how could that ever be wrong?” Sure it’s called sort, but you still have to know how to use it to get useful results out. Probably the most surprising thing about sort is that it does not by default do an alphabetic, an alphanumeric, or a numeric sort. What it actually does is something else altogether, and that something else is of surprisingly limited usefullness.

Imagine you have an array of records. It does you virtually no good to write:

@sorted_recs = sort @recs;
Because Perl’s cmp operator does only a bit comparison not an alphabetic one, it does nearly as little good to write your record sort this way:
@srecs = sort {     $b->{AGE}      <=>  $b->{AGE}                    ||     $a->{SURNAME}  cmp  $b->{SURNAME} } @recs;
The problem is that that cmp for the record’s SURNAME field is not an alphabetic comparison. It’s merely a code point comparison. That means it works like C’s strcmp function or Java’s String.compareTo method. Although commonly referred to as a “lexicographic” comparison, this is a gross misnomer: it’s about as far away from the way real lexicographers sort dictionary entries as you can get without flipping a coin.

Fortunately, you don’t have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you Unicode::Collate. Don’t let the name throw you, because while it was first invented for Unicode, it works great on regular ASCII text, too, and does a better job at making lexicographers happy than a plain old sort ever manages.

If you have code that purports to sort text that looks like this:

@sorted_lines = sort @lines;
Then all you have to get a dictionary sort is write this instead:
use Unicode::Collate; @sorted_lines = Unicode::Collate::->new->sort(@lines);
For structured records, though, like those above with ages and surnames in them, you have to be a bit fancier. One way to fix it would be to use the class’s own cmp operator instead of the built‐in one.
use Unicode::Collate; my $collator = Unicode::Collate::->new(); @srecs = sort {     $b->{AGE}  <=>  $b->{AGE}           ||     $collator->cmp( $a->{SURNAME}, $b->{SURNAME} ) } @recs;
However, that makes a fairly expensive method call for every possible comparison. Because Perl’s adaptive merge sort algorithm usually runs in O(n · log n) time given n items, and because each comparison requires two different computed keys, that can be a lot of duplicate effort. Our sorting class therefore provide a convenient getSortKey method that calculates a special binary key, which you can cache and later pass to the normal cmp operator on your own. This trick lets you use cmp yet get a truly alphabetic sort out of it for a change.

Here is a simple but sufficient example of how to do that:

use Unicode::Collate; my $collator = Unicode::Collate::->new(); # first calculate the magic sort key for each text field, and cache it for my $rec (@recs) {     $rec->{SURNAME_key} = $collator->getSortKey( $rec->{SURNAME} ); }  # now sort the records as before, but for the surname field, # use the cached sort key instead @srecs = sort {     $b->{AGE}          <=>  $b->{AGE}                       ||     $a->{SURNAME_key}  cmp  $b->{SURNAME_key} } @recs;
That’s what I meant about very carefully preparing a mediated sort key that contains the precomputed binary key.

English Card Catalogue Sorts

The simple code just demonstrated assumes you want to sort names the same way you do regular text. That isn’t a good assumption, however. Many countries, languages, institutions, and sometimes even librarians have their own notions about how a card catalogue or a phonebook ought to be sorted.

For example, in the English language, surnames with Scottish patronymics starting with ‹Mc› or ‹Mac›, like MacKinley and McKinley, not only count as completely identical synonyms for sorting purposes, they go before any other surname that begins with ‹M›, and so precede surnames like Mables or Machado. Yes, really. That means that the following names are sorted correctly — for English:

Lewis, C.S. McKinley, Bill MacKinley, Ron Mables, Martha Machado, José Macon, Bacon
Yes, it’s true. Check out your local large English‐language bookseller or library — presuming you can find one. If you do, best make sure to blow the dust off first.

Sorting Spanish Names

It’s a good thing those names are following English rules for sorting names. If this were Spanish, we would have to deal with double‐barrelled surnames, where the patronym sorts before the matronym, which in turn sorts before any given names. That mean’s that if Señor Machado’s full name were, like the poet’s, Antonio Cipriano José María y Francisco de Santa Ana Machado y Ruiz, then you would have to sort him with the other Machados but then consider Ruiz before Antonio if there were any other Machados. Similarly, the poet Federico del Sagrado Corazón de Jesús García Lorca sorts before the writer Gabriel José de la Concordia García Márquez.

On the other hand, if your records are not full multifield hashes but only simple text that don’t happen to be surnames, your task is a lot simpler, since now all you have to is get the cmp operator to behave sensibly. That you can do easily enough this way:

use Unicode::Collate; @sorted_text = Unicode::Collate::->new->sort(@text);

Sorting Text, Not Binary

Imagine you had this list of German‐language authors:
@germans = qw{ Böll Born Böhme Bodmer Brandis Böttcher Borchert Bobrowski };
If you just sorted them with an unmediated sort operator, you would get this utter nonsense:
Bobrowski Bodmer Borchert Born Brandis Brant Böhme Böll Böttcher
Or maybe this equally nonsensical answer:
Bobrowski Bodmer Borchert Born Böll Brandis Brant Böhme Böttcher
Or even this still completely nonsensical answer:
Bobrowski Bodmer Borchert Born Böhme Böll Brandis Brant Böttcher
The crucial point to all that is that it’s text not binary, so not only can you never judge what it holds for its bit patterns just by eyeballing it, more importantly, it has special rules to make it sort alphabetically (some might say sanely), an ordering no naïve code‐point sort will never come even close to getting right, especially on Unicode. The correct ordering is of course:
Bobrowski Bodmer Böhme Böll Borchert Born Böttcher Brandis Brant
And that is precisely what
use Unicode::Collate; @sorted_germans = Unicode::Collate::->new->sort(@german_names);
gives you, a correctly sorted list of those Germans’ names.

Sorting German Names

Hold on, though. Correct in what language? In English, yes, the order given is now correct. But considering that these authors wrote in the German language, it is quite conceivable that you should be following the rules for ordering German names in German, not in English. That produces this ordering:

Bobrowski Bodmer Böhme Böll Böttcher Borchert Born Brandis Brant
So how come Böttcher now came before Borchert? Because Böttcher is supposed to be the same as Boettcher. That’s because in a German phonebook or other German list of German names, things like <ö> and <oe> are considered synonyms, which is not at all how it works in English. To get the German phonebook sort, you merely have to modify your constructor this way:
use Unicode::Collate::Locale; @sorted_germans = Unicode::Collate::Locale:: ->new(locale => "de_phonebook") ->sort(@german_names);

Isn’t this fun?

Be glad you’re not sorting names. Sorting names is hard.


Default Sort Tables

Here are most of the Latin letters, ordered using the default sort:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j 
k l m n o p q r s t u v w x y z ª º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ 
ҠӠԠՠ֠ؠ٠ڠ۠ܠݠޠߠà á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö 
ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě 
Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ 
ŀ Ł ł Ń ń Ņ ņ Ň ň Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť 
ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƈ ƈ Ɖ Ɗ Ƌ 
ƌ ƍ Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ƥ ƥ Ʀ ƫ Ƭ ƭ Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ 
ƹ ƺ ƾ ƿ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ 
Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ 
ȉ Ȋ ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ ȡ Ȥ ȥ Ȧ ȧ Ȩ ȩ Ȫ ȫ Ȭ ȭ Ȯ 
ȯ Ȱ ȱ Ȳ ȳ ȴ ȵ ȶ ȷ Ⱥ Ȼ ȼ Ƚ Ⱦ ɐ ɑ ɒ ɓ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ ɡ ɢ ɣ ɤ ɥ ɦ 
ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɶ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ 
ʎ ʏ ʐ ʑ ʒ ʓ ʙ ʚ ʛ ʜ ʝ ʞ ʟ ʠ ʣ ʤ ʥ ʦ ʧ ʨ ʩ ʪ ʫ ˡ ˢ ˣ ᴀ ᴁ ᴂ ᴃ ᴄ ᴅ ᴆ ᴇ ᴈ ᴉ 
ᴊ ᴋ ᴌ ᴍ ᴎ ᴏ ᴑ ᴓ ᴔ ᴘ ᴙ ᴚ ᴛ ᴜ ᴝ ᴞ ᴟ ᴠ ᴡ ᴢ ᴣ ᴬ ᴭ ᴮ ᴯ ᴰ ᴱ ᴲ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ 
ᴻ ᴼ ᴾ ᴿ ᵀ ᵁ ᵂ ᵃ ᵄ ᵅ ᵆ ᵇ ᵈ ᵉ ᵊ ᵋ ᵌ ᵍ ᵎ ᵏ ᵐ ᵑ ᵒ ᵖ ᵗ ᵘ ᵙ ᵚ ᵛ ᵢ ᵣ ᵤ ᵥ ᵫ ᵬ ᵭ 
ᵮ ᵯ ᵰ ᵱ ᵲ ᵳ ᵴ ᵵ ᵶ Ḁ ḁ Ḃ ḃ Ḅ ḅ Ḇ ḇ Ḉ ḉ Ḋ ḋ Ḍ ḍ Ḏ ḏ Ḑ ḑ Ḓ ḓ Ḕ ḕ Ḗ ḗ Ḙ ḙ Ḛ 
ḛ Ḝ ḝ Ḟ ḟ Ḡ ḡ Ḣ ḣ Ḥ ḥ Ḧ ḧ Ḩ ḩ Ḫ ḫ Ḭ ḭ Ḯ ḯ Ḱ ḱ Ḳ ḳ Ḵ ḵ Ḷ ḷ Ḹ ḹ Ḻ ḻ Ḽ ḽ Ḿ 
ḿ Ṁ ṁ Ṃ ṃ Ṅ ṅ Ṇ ṇ Ṉ ṉ Ṋ ṋ Ṍ ṍ Ṏ ṏ Ṑ ṑ Ṓ ṓ Ṕ ṕ Ṗ ṗ Ṙ ṙ Ṛ ṛ Ṝ ṝ Ṟ ṟ Ṡ ṡ Ṣ 
ṣ Ṥ ṥ Ṧ ṧ Ṩ ṩ Ṫ ṫ Ṭ ṭ Ṯ ṯ Ṱ ṱ Ṳ ṳ Ṵ ṵ Ṷ ṷ Ṹ ṹ Ṻ ṻ Ṽ ṽ Ṿ ṿ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ẇ 
ẇ Ẉ ẉ Ẋ ẋ Ẍ ẍ Ẏ ẏ Ẑ ẑ Ẓ ẓ Ẕ ẕ ẖ ẗ ẘ ẙ ẚ ẛ ẞ ẟ Ạ ạ Ả ả Ấ ấ Ầ ầ Ẩ ẩ Ẫ ẫ Ậ 
ậ Ắ ắ Ằ ằ Ẳ ẳ Ẵ ẵ Ặ ặ Ẹ ẹ Ẻ ẻ Ẽ ẽ Ế ế Ề ề Ể ể Ễ ễ Ệ ệ Ỉ ỉ Ị ị Ọ ọ Ỏ ỏ Ố 
ố Ồ ồ Ổ ổ Ỗ ỗ Ộ ộ Ớ ớ Ờ ờ Ở ở Ỡ ỡ Ợ ợ Ụ ụ Ủ ủ Ứ ứ Ừ ừ Ử ử Ữ ữ Ự ự Ỳ ỳ Ỵ 
ỵ Ỷ ỷ Ỹ ỹ K Å Ⅎ ⅎ Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ Ⅼ Ⅽ Ⅾ Ⅿ ⅰ ⅱ ⅲ ⅳ ⅴ ⅵ ⅶ ⅷ ⅸ ⅹ ⅺ 
ⅻ ⅼ ⅽ ⅾ ⅿ ff fi fl ffi ffl ſt st A B C D E F G H I J K L M N O P 
Q R S T U V W X Y Z a b c d e f g h i j k l m n 
o p q r s t u v w x y z
As you can see, those letters are scattered all over the place. Sure, it’s not completely random, but it’s not useful either, because it is full of arbitrary placement that makes no alphabetical sense. That’s because it is not an alphabetic sort at all. However, with the special kind of sort I’ve just shown you above, the ones that call the sort method from the Unicode::Collate class, you do get an alphabetic sort. Using that method, the Latin letters I just showed you now come out in alphabetical order, which is like this:
a a A A ª ᵃ ᴬ á Á à À ă Ă ắ Ắ ằ Ằ ẵ Ẵ ẳ Ẳ â Â ấ Ấ ầ Ầ ẫ Ẫ ẩ Ẩ ǎ Ǎ å Å 
Å ǻ Ǻ ä Ä ǟ Ǟ ã Ã ȧ Ȧ ǡ Ǡ ą Ą ā Ā ả Ả ȁ Ȁ ȃ Ȃ ạ Ạ ặ Ặ ậ Ậ ḁ Ḁ æ Æ ᴭ ǽ Ǽ 
ǣ Ǣ ẚ ᴀ Ⱥ ᴁ ᴂ ᵆ ɐ ᵄ ɑ ᵅ ɒ b b B B ᵇ ᴮ ḃ Ḃ ḅ Ḅ ḇ Ḇ ʙ ƀ ᴯ ᴃ ᵬ ɓ Ɓ ƃ Ƃ c 
c ⅽ C C Ⅽ ć Ć ĉ Ĉ č Č ċ Ċ ç Ç ḉ Ḉ ᴄ ȼ Ȼ ƈ Ƈ ɕ d d ⅾ D D Ⅾ ᵈ ᴰ ď Ď ḋ 
Ḋ ḑ Ḑ ḍ Ḍ ḓ Ḓ ḏ Ḏ đ Đ ð Ð dz ʣ Dz DZ dž Dž DŽ ʥ ʤ ᴅ ᴆ ᵭ ɖ Ɖ ɗ Ɗ ƌ Ƌ ȡ ẟ e e E 
E ᵉ ᴱ é É è È ĕ Ĕ ê Ê ế Ế ề Ề ễ Ễ ể Ể ě Ě ë Ë ẽ Ẽ ė Ė ȩ Ȩ ḝ Ḝ ę Ę ē Ē ḗ 
Ḗ ḕ Ḕ ẻ Ẻ ȅ Ȅ ȇ Ȇ ẹ Ẹ ệ Ệ ḙ Ḙ ḛ Ḛ ᴇ ǝ Ǝ ᴲ ə Ə ᵊ ɛ Ɛ ᵋ ɘ ɚ ɜ ᴈ ᵌ ɝ ɞ ʚ ɤ 
f f F F ḟ Ḟ ff ffi ffl fi fl ʩ ᵮ ƒ Ƒ ⅎ Ⅎ g g G G ᵍ ᴳ ǵ Ǵ ğ Ğ ĝ Ĝ ǧ Ǧ ġ Ġ ģ 
Ģ ḡ Ḡ ɡ ɢ ǥ Ǥ ɠ Ɠ ʛ ɣ Ɣ h h H H ᴴ ĥ Ĥ ȟ Ȟ ḧ Ḧ ḣ Ḣ ḩ Ḩ ḥ Ḥ ḫ Ḫ ẖ ħ Ħ ʜ 
ƕ ɦ ɧ i i ⅰ I I Ⅰ ᵢ ᴵ í Í ì Ì ĭ Ĭ î Î ǐ Ǐ ï Ï ḯ Ḯ ĩ Ĩ İ į Į ī Ī ỉ Ỉ ȉ 
Ȉ ȋ Ȋ ị Ị ḭ Ḭ ⅱ Ⅱ ⅲ Ⅲ ij IJ ⅳ Ⅳ ⅸ Ⅸ ı ɪ ᴉ ᵎ ɨ Ɨ ɩ Ɩ j j J J ᴶ ĵ Ĵ ǰ ȷ ᴊ 
ʝ ɟ ʄ k k K K K ᵏ ᴷ ḱ Ḱ ǩ Ǩ ķ Ķ ḳ Ḳ ḵ Ḵ ᴋ ƙ Ƙ ʞ l l ⅼ L L Ⅼ ˡ ᴸ ĺ Ĺ 
ľ Ľ ļ Ļ ḷ Ḷ ḹ Ḹ ḽ Ḽ ḻ Ḻ ł Ł ŀ Ŀ lj Lj LJ ʪ ʫ ʟ ᴌ ƚ Ƚ ɫ ɬ ɭ ȴ ɮ ƛ ʎ m m ⅿ M 
M Ⅿ ᵐ ᴹ ḿ Ḿ ṁ Ṁ ṃ Ṃ ᴍ ᵯ ɱ n n N N ᴺ ń Ń ǹ Ǹ ň Ň ñ Ñ ṅ Ṅ ņ Ņ ṇ Ṇ ṋ Ṋ ṉ 
Ṉ nj Nj NJ ɴ ᴻ ᴎ ᵰ ɲ Ɲ ƞ Ƞ ɳ ȵ ŋ Ŋ ᵑ o o O O º ᵒ ᴼ ó Ó ò Ò ŏ Ŏ ô Ô ố Ố ồ 
Ồ ỗ Ỗ ổ Ổ ǒ Ǒ ö Ö ȫ Ȫ ő Ő õ Õ ṍ Ṍ ṏ Ṏ ȭ Ȭ ȯ Ȯ ȱ Ȱ ø Ø ǿ Ǿ ǫ Ǫ ǭ Ǭ ō Ō ṓ 
Ṓ ṑ Ṑ ỏ Ỏ ȍ Ȍ ȏ Ȏ ớ Ớ ờ Ờ ỡ Ỡ ở Ở ợ Ợ ọ Ọ ộ Ộ œ Œ ᴏ ᴑ ɶ ᴔ ᴓ p p P P ᵖ 
ᴾ ṕ Ṕ ṗ Ṗ ᴘ ᵱ ƥ Ƥ q q Q Q ʠ ĸ r r R R ᵣ ᴿ ŕ Ŕ ř Ř ṙ Ṙ ŗ Ŗ ȑ Ȑ ȓ Ȓ ṛ 
Ṛ ṝ Ṝ ṟ Ṟ ʀ Ʀ ᴙ ᵲ ɹ ᴚ ɺ ɻ ɼ ɽ ɾ ᵳ ɿ ʁ s s S S ˢ ś Ś ṥ Ṥ ŝ Ŝ š Š ṧ Ṧ ṡ 
Ṡ ş Ş ṣ Ṣ ṩ Ṩ ș Ș ſ ẛ ß ẞ st ſt ᵴ ʂ ʃ ʅ ʆ t t T T ᵗ ᵀ ť Ť ẗ ṫ Ṫ ţ Ţ ṭ Ṭ 
ț Ț ṱ Ṱ ṯ Ṯ ʨ ƾ ʦ ʧ ᴛ ŧ Ŧ Ⱦ ᵵ ƫ ƭ Ƭ ʈ Ʈ ȶ ʇ u u U U ᵘ ᵤ ᵁ ú Ú ù Ù ŭ Ŭ 
û Û ǔ Ǔ ů Ů ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ǖ Ǖ ű Ű ũ Ũ ṹ Ṹ ų Ų ū Ū ṻ Ṻ ủ Ủ ȕ Ȕ ȗ Ȗ ư Ư 
ứ Ứ ừ Ừ ữ Ữ ử Ử ự Ự ụ Ụ ṳ Ṳ ṷ Ṷ ṵ Ṵ ᴜ ᴝ ᵙ ᴞ ᵫ ʉ ɥ ɯ Ɯ ᵚ ᴟ ɰ ʊ Ʊ v v ⅴ V 
V Ⅴ ᵛ ᵥ ṽ Ṽ ṿ Ṿ ⅵ Ⅵ ⅶ Ⅶ ⅷ Ⅷ ᴠ ʋ Ʋ ʌ w w W W ᵂ ẃ Ẃ ẁ Ẁ ŵ Ŵ ẘ ẅ Ẅ ẇ Ẇ ẉ 
Ẉ ᴡ ʍ x x ⅹ X X Ⅹ ˣ ẍ Ẍ ẋ Ẋ ⅺ Ⅺ ⅻ Ⅻ y y Y Y ý Ý ỳ Ỳ ŷ Ŷ ẙ ÿ Ÿ ỹ Ỹ ẏ 
Ẏ ȳ Ȳ ỷ Ỷ ỵ Ỵ ʏ ƴ Ƴ z z Z Z ź Ź ẑ Ẑ ž Ž ż Ż ẓ Ẓ ẕ Ẕ ƍ ᴢ ƶ Ƶ ᵶ ȥ Ȥ ʐ ʑ 
ʒ Ʒ ǯ Ǯ ᴣ ƹ Ƹ ƺ ʓ ȝ Ȝ þ Þ ƿ Ƿ
Isn’t that much nicer?


Romani Ite Domum

In case you’re wondering what that last row of distinctly un‐Roman Latin letters might possibly be, they’re called respectively ezh ʒ, yogh ȝ, thorn þ, and wynn ƿ. They had to go somewhere, so they ended up getting stuck after ‹z›.

Some are still used in certain non‐English (but still Latin) alphabets today, such as Icelandic, and even though you probably won’t bump into them in contemporary English texts, you might see some if you’re reading the original texts of famous medieval English poems like Beowulf, Sir Gawain and the Green Knight, or Brut.

The last of those, Brut, was written by a fellow named Laȝamon, a name whose third letter is a yogh. Famous though he was, I wouldn’t suggest changing your name to ‹Laȝamon› in his honor, as I doubt the phone company would be amused.

Replies are listed 'Best First'.
Re^2: best sort
by BrowserUk (Patriarch) on Aug 15, 2011 at 01:05 UTC

    Calling sort without a comparison function is nearly always the wrong thing to do. However, I can think of two exceptions to this.

    The first exception is when you have no other goal but to iterate in an order that is merely repeatable, even if that order should happen to completely arbitrary.

    Do you have any statistics -- real or made up -- about the percentage of sorts performed by programs in Perl (or any or all programming languages) that do not fit the criteria of your first "exception"?

    Because after ~30 years of programming in which my gut feel is that 90% or there about of my programs have included a sort for some purpose or another, I can remember just 2 occasions when language dictionary ordering was a goal. One was an internal phone directory. The other was the 6 million records of West African Examination Council University entrance examination results.

    At ~300 people, the first was sufficiently short that whether 'Mac...' came before, after or adjacent to 'Mc...' was irrelevant.

    In the latter, which encompasses names from 5 countries: Ghana, Liberia, Nigeria, Sierra Leone, and the Gambia, with all their combined mixes of ethnic, religious, tribal and imperial languages, cultures and backgrounds, meant that it was impossible to get them to agree on a single correct collation. Hence, the reports produced for each country, and even different areas within the same country, all had to be ordered differently according to a custom collation.

    That was a good many years ago, but I guarantee that there isn't anything on CPAN that would have handled any single one of the dozen or so collation sequences required.

    My personal gut-feel assessment is that for 95% (and probably more) uses of sort in programs, the actual collation ordering used is immaterial. What's yours?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I almost never deal with people’s names; I think doing so is pretty rare. However, except for the Mc/Mac thing, whose solution I didn’t actually include because it requires more explanation than I cared to go into then (you need to supply a preprocess transform to the constructor), all the rest is pretty standard text stuff.

      Almost the only time I sort stuff is for output. Otherwise order is probably immaterial. And if I’m sorting for output, I want to be able to look at text in the order that English has used for hundreds of years to sort its text. I do not want to look at things in random code order; down that road lies madness.

      You can look at the the default UCA sort as a very good approximation to historical order of text. No, it isn’t perfect, but everyone is equally disadvantaged. For example, it doesn’t do the English Mc/Mac trick unassisted.

      What is does do, though, seems to be the logical thing for text in a way that code point order never is. Explaining it makes it seem harder than it is, but it isn’t hard; it’s what makes sense for text, and it’s how we’ve done it forever.

      This is a little bit of a simplification, but it works essentially this way:

      1. Primary strength
      Compare to see whether the basic letters are the same. Ignore nonletters at this stage; just skip ahead till you find a letter. If the letters aren’t the same for the same relative position, there is an established dictionary order about what goes first. If you are a user of the Latin alphabet, this will be in the order of the abc’s you learned in school, so Fred comes before freedom, as does free beer. The reason it put free beer in front of freedom is because the fifth letter in the first string is b, and that comes before the fifth letter in the second string, which is d. See how that works? That’s dictionary order. We aren’t doing a field sort here.
      2. Secondary strength
      If the letters are the same, then check whether the diacritics are the same. By default we resolve ties by looking at the diacritics reading left to right, but this can be flipped to do so right to left to keep the French less unhappy. (The classic demo is that normal LTR tie‐breaking order sorts cote < coté < côte < côté, whereas the French RTL tie‐breaking order for diacritics sorts cote < côte < coté < côté. Yes, really, and I’m sorry, but it’s truly not my fault. It has to do with their inflectional morphology, which is tail‐based.)
      3. Tertiary strength
      If the letters and the diacritics are the same, then check whether the case is the same. By default, lowercase precedes uppercase, but this is easy to flip.
      4. Quaternary strength
      If the letters, the diacritics, and the case are all the same for a given position, now you may go back and reconsider any nonletters, like punctuation and symbols and whitespace.

      You don’t have to do all those if you don’t want. You can for example tell it to use only the primary strength, which only considers basic letters and absolutely nothing else, even case. (That’s how you do an “accent‐insensitive” string comparison, BTW, using your collator object’s eq method.) If you wanted it to ignore case but consider accents for level one ties, just set it to do only the first two stages and skip the rest.

      You can even define more levels that the standard four if you want, and all kinds of things can be tweaked, but the default is usually good enough. This works how pretty well for English text, and it still works even if it has non‐Latin mixed in with it here and there, since Greek letters would sort within their own alphabet, &c.

      Strings like Joe Bob, Joe‐Bob, Joebob, and Jo E. Bob are all the same at the first two levels, and only diverge after that, with case differences at the third level and nonletters at the fourth. Yes, this really does make it easier to read lists of text. I have processed thousands and thousands of output reports of English text this way, and it really does.

      If you don’t like how it by default ignores nonletters (which is a bit misleading of me to put it that way, since numbers come before letters and such; it isn’t as stupid as I may have made it sound) at the primary comparison stage, that is tweakable.

      I guess one way to think of the default UCA sort in Perl pseudocode would be:

      primary($a)    <=>   primary($b)  || secondary($a)  <=>   secondary($b)  || tertiary($a)   <=>   tertiary($b)  || quaternary($a) <=>   quaternary($b) 
      Like the code above, the multilevel sort short circuits when it finds a difference between its operands at the level. You can set it to do only N levels if you want. That presumes that those functions are defined to return the right sort of magic number. It doesn’t account for how the default ignorables get skipped unless you manage to fall through to the fourth level. Remember that letter positions aren’t the same as equivalent indices of arrays.

      I bet I’ve managed to convince everyone that nobody needs anything as complicated as this and that you should all go back to garbage ordering. That certainly isn’t my intent. If you just try the collator object’s default sort on text the next time you have to sort some text, without any tweaking or anything, I think you will be aesthetically pleased with the results. More pleased, in fact, than you are(n’t) by the code point garbage order you get with the unmodified builtin sort.

      In other words: “Try it, you’ll like it!”

      DISCLAIMER:
      As always, your mileage may vary — except of course in Europe where this statement is illegal, in which case you didn’t read it anyway so no harm no foul.
      I thought about it a little more, and decided to revise my answer a bit so I could provide three scenarios where a bare sort might be called for, and also to give a shallower ramp‐up before plunging right into it all.

        I think that the changes are an improvement. But only a slight one.

        Essentially, I think by overstating the case, you are diluting its impact.

        You use emotive terms: "garbage ordering". You make it sound as if you believe that 99% of all uses of sort should use your module and pour scorn on the idea that anyone might not. All this despite that people have been successfully -- enough for virtually all requirements -- been sorting data, including text, for 30 years without it.

        For that, (IMO) very small -- perhaps 1% or 2% -- of cases where people might actually be offended by the relative ordering of accented characters, your module may be valuable. The rest of the time it is barely more than an expensive nice-to-have.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: best sort
by ag4ve (Monk) on Aug 15, 2011 at 02:51 UTC

    thanks for the input. i think the main point of it was: normalize the data so that it is easier to sort and go from there - don't try to reinvent the wheel, but make sure you allow it to rotate as smoothly as possible.

    @BrowserUk - i think @tchrist did point to edge cases. however, it did more thoroughly present the point - i don't know that i would have understood what he was getting at if he would have just presented his point in the first sentence (which was his point).

    so, i will go and see how to best accomodate sort with my hash data :)

Re^2: best sort
by jpl (Monk) on Aug 16, 2011 at 13:24 UTC
    Calling sort without a comparison function is quite often the wrong thing to do, even on plain text.

    There are a couple things that sorting can do for you, but they won't necessarily happen if you supply your own comparison function.

    • You can tell if some string you are interested would come before or after an arbitrary item in the sorted output.
    • All "identical" items will appear together.

    Those of us accustomed to dealing solely with 7-bit English characters take advantage of these properties which fall out nicely with the default comparison function. For example, if I am looking for the word "macnulty" and I find "machinery" in the sorted output I know I must look later in the list, and I know that "mable" cannot appear following either. I further know that all occurrences of "machinery" appear together, so, having found "machinery" and something else, I will never again see "machinery" in the list. This is true of all prefixes of strings as well as complete strings.

    When one ventures outside of 7-bit code points, as a good citizen of the world must, or into domain-specific applications, like surnames, the "proper" sort order may not preserve these properties. "machinery" may follow both "macnulty" and "mable". I might encounter "mcnulty" between two distinct occurrences of "macnulty" unless someone has been careful to tie-break nominally (an appropriate term in this case) identical surnames using something very like the default comparison function.

    The comparison function is a contract, of sorts, between the producer of and the consumers of the sorted output. If the producer and consumers have different expectations, something unseemly is likely to occur. One of my first assignments was to sort white pages listings using rules that were so complex (where does "General John Smith Jr" sort relative to "Doctor John J Smith III"?) that I could be certain that the average white pages user had no clue what the rules actually were. Fortunately, most lists of names were short enough that users could spot the right name without understanding the sort order. Sorting has its place, but indexing on N-grams is proving to be a more user-friendly search mechanism.

      There are a couple things that sorting can do for you, but they are won't necessarily happen if you supply your own comparison function.
      • You can tell if some string you are interested would come before or after an arbitrary item in the sorted output.
      • All "identical" items will appear together.
      As you say, it won’t necessarily happen, but it usually does, because you know what your comparison function is doing.
      Tail:
      menorrhoeic pinnisected chimaerid weighable foregone plastochrone Alf gravamen hemen preabdomen metanomen praenomen Carmen acumen tegumen bitumen prolusion Malabar watercress usheress speechcraft unshrew windingly xyzzy
      Length:
      Alf hemen xyzzy acumen Carmen bitumen Malabar tegumen unshrew foregone gravamen usheress chimaerid metanomen praenomen prolusion weighable windingly preabdomen watercress menorrhoeic pinnisected speechcraft plastochrone
      By vowels:
      Alf Malabar gravamen Carmen watercress praenomen plastochrone acumen metanomen preabdomen hemen speechcraft weighable menorrhoeic tegumen chimaerid pinnisected windingly bitumen foregone prolusion unshrew usheress xyzzy
      By consonants:
      bitumen chimaerid acumen Carmen foregone gravamen hemen Alf Malabar menorrhoeic metanomen unshrew plastochrone pinnisected preabdomen prolusion praenomen usheress speechcraft tegumen weighable windingly watercress xyzzy
      Syllable count:
      Alf Carmen foregone hemen speechcraft unshrew acumen bitumen chimaerid gravamen Malabar plastochrone praenomen prolusion tegumen usheress watercress weighable windingly metanomen preabdomen menorrhoeic pinnisected xyzzy
      Vowel count:
      menorrhoeic weighable metanomen praenomen preabdomen chimaerid plastochrone pinnisected foregone prolusion Malabar gravamen speechcraft watercress tegumen usheress windingly acumen bitumen hemen xyzzy Carmen unshrew Alf
      Consonant density:
      xyzzy windingly acumen Alf bitumen Carmen chimaerid foregone gravamen hemen Malabar menorrhoeic metanomen pinnisected plastochrone praenomen preabdomen prolusion speechcraft tegumen unshrew usheress watercress weighable
      Code point summation:
      Alf hemen Carmen xyzzy acumen Malabar bitumen tegumen unshrew gravamen foregone usheress chimaerid weighable metanomen praenomen windingly prolusion preabdomen watercress speechcraft pinnisected menorrhoeic plastochrone
      Of those, about the only one I can’t really pretty much eyeball is the last one, because I know what I am comparing and how. If those were the things I was looking into, I would certainly not want a bare sort for any of them.
        If those were the things I was looking into, I would certainly not want a bare sort for any of them.

        If you anticipate repeated elements, you'd probably want a tie-breaking

        $a cmp $b
        (the bare sort comparison function) on all but the first, because non-identical terms can otherwise compare equal and identical elements may not be adjacent in the sorted output. But I think we are in fundamental agreement: You need to know what you are comparing and how. That may be easier said than done. Knowing that "machenry" is a Scottish name, but "machinery" is not (or is it, and, if not, why not) makes "knowing what you are comparing" non-trivial.
Re^2: best sort
by BrowserUk (Patriarch) on Aug 21, 2011 at 11:05 UTC

    It is arguable that the most experienced body when it comes to dealing with the problems of different character sets, diacriticals and their influence on collation ordering is the European Union.

    Here are the official EU collation sequences for its member countries:

    • Bulgarian: АБВ Г Д ЕЖЗИЙ К Л МН ОП РС Т УФХЦЧШЩЪЬЮЯ
    • Serbia & Montenegro: AÁ BCČ DĎ EÉEFG H IÍ JK L MNŇ OÓ P RŘSŠ TŤ UÚUŮVWXY ZŽ
    • Denmark: A BC D E FG H I JK L MN O PQR S T U VWXY ZÆØÅ
    • Germany: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Estonia: A BC D E FG H I JK L MN O PQR SŠZŽT U VWÕÄÖÜXY
    • Greece: Α ΒΓ Δ ΕΖ ΗΘ Ι ΚΛ ΜΝΞ ΟΠ ΡΣ Τ ΥΦΧΨCEGIPRSU
    • United Kingdom: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Spain: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • France: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Ireland: AÁ BC D E FG H IÍ L MN OÓ PQR S T UÚ VWXY Z
    • Italy: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Lithuania: AĄ BCČ D EĘĖFG H IĮYJK L MN O P R SŠ T UŲŪ V ZŽ
    • Latvia: AĀ BCČ D EĒ FGĢ H IĪ JKĶLĻ MNŅ OO P R SŠ T UŪ V ZŽ
    • Hungary: AÁ BCCsDDzDzsEÉ FG H IÍ JK L MNNyOÓÖŐPQR SSz TTyUÚÜŰV ZZs
    • Malta: A BĊ D E FĠGGħHĦ IIEJK L MN O PQR S T U VWX ŻZ
    • Holland: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Poland: AA BCĊ D EĘ FG H I JK LŁ MNŃ OÓ P R SŚ T U W Y ZŻŹ
    • Portugal: A BC D E FG H I JK L MN O PQR S T U VWXY Z
    • Romania: AĂÂBC D E FG H IÎ JK L MN O P R SŞ TŢ U V XY Z
    • Slovak Republic: AÁÄBCČ DĎDzDžEÉ FG HChIÍ JK LĹĽMNŇ OÓÔ PQRŔSŠ TŤ UÚ VWXYÝZŽ
    • Slovenia: A BCČ D E FG H I JK L MN O P R SŠ T U V ZŽ
    • Sweden: A BC D E FG H I JK L MN O PQR S T U VWXY Z ÅÄÖ
    • Finland: A BC D E FG H I JK L MN O PQR S T U VWXY Z ÄÖ

    As best I can tell, your invented ordering would be incorrect for every country excepting possibly the UK. In particular note how Denmark, Sweden and Finland order those characters with diacriticals at the end.

    Over thirty years ago (or more, its not clear), realising that there is no way to resolve the disparate expectations of all the member countries, the EU took the pragmatic approach to solving this problem.

    Using Accented and Other Special Characters in Searching

    The EU Inventories contain data in all Community languages except Greek and many of these languages contain accented characters in their alphabet.

    All words containing accented characters are displayed as such in both WinSPIRS and WebSPIRS. For the former, you may need to choose a font other than the default font if it does not support the ISO 8859-1 (Latin alphabet No. 1) character set (known elsewhere in this database compendium as ISO Latin-1) for display/printing. All words containing accented or foreign characters (as well as a to z and A to Z) are converted to their upper case equivalents and then indexed as such. The collating sequence chosen for all indices in all languages is that for ISO Latin-1 except that all terms beginning with a numeric character appear at end. This has been done to provide ease and consistency in a multi-lingual and multi-database (i.e. when two or more databases from different languages are selected for retrieval) environment.

    The actual collating sequence or character order in all indices is:

    -, ., A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, À, Á, Â, Ã, Ä, Å, Æ, Ç, È, É, Ê, Ë, Ì, Í, Î, Ï, Ñ, Ò, Ó, Ô, Õ, Ö, Ø, Ù, Ú, Û, Ü, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Several of those sequences are demonstrably wrong: they make a claim for a national collating sequence that is specifically counter to the ordering specifically prescribed by monolingual language books in that language. For example, ninuco would collate before niño in Spain, not after it.

      I refuse to take seriously some Stone Age directive to pretend everything is ISO 9959-1. That has no relevance to Unicode whatsoever. It is stupid and wrong, and indeed offensive. Your lists are useless to the point of risibility. We don't live in a Latin-1 world anymore, and indeed never did.

      What an utter waste of time.

        I refuse to take seriously some Stone Age directive to pretend everything is ISO 9959-1.

        And that is exactly what is wrong with Unicode. It was formulated by American Companies, for American Companies in their typical "We'll right the world's wrongs and they'll see the superiority of our edicts" mentality. You doubt this, look up HAN unification

        Well guess what! Whilst you're making shit up and trying foist your woefully incomplete -- you've still to answer how you're going to work the cyrillic, Indus, HAN and all the other non-roman scripts into your magical world of -- "unified collation ordering" upon the world, some of us having been getting on with the pragmatic process of getting things to work in the real world. For 30 years or more.

        Your lists are useless to the point of risibility. We don't live in a Latin-1 world anymore, and indeed never did.

        What an utter waste of time.

        Those lists and quotes are taken directly from the EU's website. Live, current, mandated by EU law and working across 27 countries, 500 million people, and 25% of the global economy.

        And you're missing the point. The text doesn't have to be restricted to or stored as Latin-1. It is just collated and indexed that way.

        Jump up and down all you like, we'll see whether your ideas still persist say 30 years from now. Like I said waaay back there, history will tell.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://920264]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-29 13:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found