Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: sorting Chinese characters

by choroba (Chancellor)
on Feb 01, 2013 at 10:28 UTC ( #1016506=note: print w/replies, xml ) Need Help??

in reply to sorting Chinese characters

Which modules and functions from Task::Unicode have you used?
لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: sorting Chinese characters
by larryk (Friar) on Feb 02, 2013 at 03:28 UTC
    I found and tried Unicode::Collate::Locale->new(locale => 'en-US')->sort() instead of regular Perl sort() and that has improved the ratio of reachable to unreachable keys in the index. Now there are only 126 unreachable (compared to 188, earlier) out of a total of ~166k entries.

    It's good, but not good enough. I could live with 0.1% droppage from the index except that one of the dropped keys is '⼀' which is the Chinese character meaning 'one'. That's a noddy mistake in a Chinese-English dictionary application so I need to still fix it.

    Problem is that I think I'm almost at the this must be a subtle difference in the Unicode tables of Perl and C# stage, so perhaps you have some specific experience sorting Unicode, or were you just suggesting I look at this package of libraries?

    perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
      Why are you using English locale for Chinese?

      It would not be surprising if the Unicode support was weaker in C# than in Perl, but I have no experience witn C# and it is not mentioned in Unicode Good, Bad, & Ugly.

      I have no experience with sorting Chinese. I have studied articles like What's wrong with sort and how to fix it, though, and at work I am dealing mostly with Czech, which fortunately uses Latin letters (plus some less common diacritics like ř or ů), but whose "official" sorting algorithm is unfortunately practically unimplementable (e.g. numbers should be sorted as pronounced).

      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        I have three indexes and only one is Chinese. If I needed a 'display sort' (i.e. I was going to show the index keys to the user) then I'd specialise my indexing class and use a native locale for the key type. However, in this case, the keys only need to be internally consistent between index creation (Perl) and consumption (C#). I was expecting that Unicode support between the two languages would be sufficiently mature that I could rely on the defined standard.

        As it turns out, when I re-sorted the index in a debugging session in C# and then diffed the Perl index vs. the C# index, there were fewer differences than unreachable keys. A large block of mis-sorted entries were disrupting the binary search for proximate entries (including 'one') and then there were two or three Chinese characters which were sorting oddly between the two languages in a few places, which were causing the rest of the dead-ends.

        By excluding these few entries during index creation, I now have 100% match.

        If I get some time it'll be interesting to find out why those few characters are really being sorted in a different order. Might be a bug in either C# or Perl.

        Anyway, thanks for your help.

        perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1016506]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2017-01-20 07:14 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (173 votes). Check out past polls.