sorting Chinese characters

larryk has asked for the wisdom of the Perl Monks concerning the following question:

Hi! It has been some time since my last post but I haven't forgotten about you guys :)

Bit of an oddball query, this one... but I bet someone knows one or two ways to do it.

I'm creating search indexes with Perl for my Chinese dictionary mobile app and am hitting a problem where the (Perl) sorted keys - Chinese characters - are not in exactly the same order as the binary search on-device is expecting them to be.

In real terms, there are 165910 index records and 188 of them are unreachable on-device because the sort order is slightly different between Perl's standard string sort and C#'s String.Compare function (with "en-US" culture).

I've played around for weeks with the culture settings and this 188 unreachable number is the optimal result I have achieved. So 2 questions:

1. (long shot) has anyone seen this issue before so knows the magic incantation to get Perl and C# to agree 100% on sort order for UTF-8?

2. (failing that) how do I get Perl to sort by Unicode code point (i.e. the raw underlying \u{xxxx} value, because I can probably force C# to do it that way as an exception for this index?

Any help much appreciated.

   larryk                                          
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Comment on sorting Chinese characters

Replies are listed 'Best First'.
Re: sorting Chinese characters by choroba (Cardinal) on Feb 01, 2013 at 10:28 UTC
Which modules and functions from Task::Unicode have you used? لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: sorting Chinese characters by larryk (Friar) on Feb 02, 2013 at 03:28 UTC
I found and tried `Unicode::Collate::Locale->new(locale => 'en-US')->sort()` instead of regular Perl `sort()` and that has improved the ratio of reachable to unreachable keys in the index. Now there are only 126 unreachable (compared to 188, earlier) out of a total of ~166k entries. It's good, but not good enough. I could live with 0.1% droppage from the index except that one of the dropped keys is '⼀' which is the Chinese character meaning 'one'. That's a noddy mistake in a Chinese-English dictionary application so I need to still fix it. Problem is that I think I'm almost at the this must be a subtle difference in the Unicode tables of Perl and C# stage, so perhaps you have some specific experience sorting Unicode, or were you just suggesting I look at this package of libraries? larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply] [d/l] [select]
Re^3: sorting Chinese characters by choroba (Cardinal) on Feb 02, 2013 at 09:31 UTC
Why are you using English locale for Chinese? It would not be surprising if the Unicode support was weaker in C# than in Perl, but I have no experience witn C# and it is not mentioned in Unicode Good, Bad, & Ugly. I have no experience with sorting Chinese. I have studied articles like What's wrong with sort and how to fix it, though, and at work I am dealing mostly with Czech, which fortunately uses Latin letters (plus some less common diacritics like ř or ů), but whose "official" sorting algorithm is unfortunately practically unimplementable (e.g. numbers should be sorted as pronounced). لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^4: sorting Chinese characters by larryk (Friar) on Feb 12, 2013 at 05:31 UTC
Re: sorting Chinese characters by Anonymous Monk on Feb 01, 2013 at 10:41 UTC
Re: best sort, Unicode::Collate, Unicode::ICU::Collator, sort unicode, sort chinese, Sorting according to locale collation, Sort::ArbBiLex, Sorting (GRT) and locale issues, Sort::Key::Multi	[reply]
Re: sorting Chinese characters by punch_card_don (Curate) on Feb 01, 2013 at 13:39 UTC
If it were me, I think I'd step back from the problem a moment and ask myself if trying to synchronize two independent sorting mechanisms is really what I want. Sounds like a potential source of on-going headaches. Would assigning a unique identifier that defines order on both systems work in the given situation? Time flies like an arrow. Fruit flies like a banana.	[reply]
Re^2: sorting Chinese characters by larryk (Friar) on Feb 02, 2013 at 04:00 UTC
The key in the index is unique the Unicode standard is supposed to define the order. I was rather hoping I could depend on that, since it is supposed to be a standard. Looks like my options are: I could use the ordinal value of the characters and do a numeric sort instead of a textual sort. That's what I was suggesting in point 2 above. The only other option is to ditch my Perl index creation and rewrite it in C# so that the index creation and the index usage are both using exactly the same sorting library. Not sure I want to go there yet, though. Any other options? larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply]


Syntactic Confectionery Delight
	PerlMonks