Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: sorting Chinese characters

by larryk (Friar)
on Feb 02, 2013 at 03:28 UTC ( [id://1016638]=note: print w/replies, xml ) Need Help??


in reply to Re: sorting Chinese characters
in thread sorting Chinese characters

I found and tried Unicode::Collate::Locale->new(locale => 'en-US')->sort() instead of regular Perl sort() and that has improved the ratio of reachable to unreachable keys in the index. Now there are only 126 unreachable (compared to 188, earlier) out of a total of ~166k entries.

It's good, but not good enough. I could live with 0.1% droppage from the index except that one of the dropped keys is '⼀' which is the Chinese character meaning 'one'. That's a noddy mistake in a Chinese-English dictionary application so I need to still fix it.

Problem is that I think I'm almost at the this must be a subtle difference in the Unicode tables of Perl and C# stage, so perhaps you have some specific experience sorting Unicode, or were you just suggesting I look at this package of libraries?

   larryk                                          
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Replies are listed 'Best First'.
Re^3: sorting Chinese characters
by choroba (Cardinal) on Feb 02, 2013 at 09:31 UTC
    Why are you using English locale for Chinese?

    It would not be surprising if the Unicode support was weaker in C# than in Perl, but I have no experience witn C# and it is not mentioned in Unicode Good, Bad, & Ugly.

    I have no experience with sorting Chinese. I have studied articles like What's wrong with sort and how to fix it, though, and at work I am dealing mostly with Czech, which fortunately uses Latin letters (plus some less common diacritics like ř or ů), but whose "official" sorting algorithm is unfortunately practically unimplementable (e.g. numbers should be sorted as pronounced).

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      I have three indexes and only one is Chinese. If I needed a 'display sort' (i.e. I was going to show the index keys to the user) then I'd specialise my indexing class and use a native locale for the key type. However, in this case, the keys only need to be internally consistent between index creation (Perl) and consumption (C#). I was expecting that Unicode support between the two languages would be sufficiently mature that I could rely on the defined standard.

      As it turns out, when I re-sorted the index in a debugging session in C# and then diffed the Perl index vs. the C# index, there were fewer differences than unreachable keys. A large block of mis-sorted entries were disrupting the binary search for proximate entries (including 'one') and then there were two or three Chinese characters which were sorting oddly between the two languages in a few places, which were causing the rest of the dead-ends.

      By excluding these few entries during index creation, I now have 100% match.

      If I get some time it'll be interesting to find out why those few characters are really being sorted in a different order. Might be a bug in either C# or Perl.

      Anyway, thanks for your help.

         larryk                                          
      perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1016638]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found