Re^2: sorting Chinese characters

I found and tried Unicode::Collate::Locale->new(locale => 'en-US')->sort() instead of regular Perl sort() and that has improved the ratio of reachable to unreachable keys in the index. Now there are only 126 unreachable (compared to 188, earlier) out of a total of ~166k entries.

It's good, but not good enough. I could live with 0.1% droppage from the index except that one of the dropped keys is '⼀' which is the Chinese character meaning 'one'. That's a noddy mistake in a Chinese-English dictionary application so I need to still fix it.

Problem is that I think I'm almost at the this must be a subtle difference in the Unicode tables of Perl and C# stage, so perhaps you have some specific experience sorting Unicode, or were you just suggesting I look at this package of libraries?

   larryk                                          
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Comment on Re^2: sorting Chinese characters Select or Download Code

Replies are listed 'Best First'.
Re^3: sorting Chinese characters by choroba (Cardinal) on Feb 02, 2013 at 09:31 UTC
Why are you using English locale for Chinese? It would not be surprising if the Unicode support was weaker in C# than in Perl, but I have no experience witn C# and it is not mentioned in Unicode Good, Bad, & Ugly. I have no experience with sorting Chinese. I have studied articles like What's wrong with sort and how to fix it, though, and at work I am dealing mostly with Czech, which fortunately uses Latin letters (plus some less common diacritics like ř or ů), but whose "official" sorting algorithm is unfortunately practically unimplementable (e.g. numbers should be sorted as pronounced). لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^4: sorting Chinese characters by larryk (Friar) on Feb 12, 2013 at 05:31 UTC
I have three indexes and only one is Chinese. If I needed a 'display sort' (i.e. I was going to show the index keys to the user) then I'd specialise my indexing class and use a native locale for the key type. However, in this case, the keys only need to be internally consistent between index creation (Perl) and consumption (C#). I was expecting that Unicode support between the two languages would be sufficiently mature that I could rely on the defined standard. As it turns out, when I re-sorted the index in a debugging session in C# and then diffed the Perl index vs. the C# index, there were fewer differences than unreachable keys. A large block of mis-sorted entries were disrupting the binary search for proximate entries (including 'one') and then there were two or three Chinese characters which were sorting oddly between the two languages in a few places, which were causing the rest of the dead-ends. By excluding these few entries during index creation, I now have 100% match. If I get some time it'll be interesting to find out why those few characters are really being sorted in a different order. Might be a bug in either C# or Perl. Anyway, thanks for your help. larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply]


Do you know where your variables are?
	PerlMonks