Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Perl Monk, Perl Meditation
 
PerlMonks  

sorting Chinese characters

by larryk (Friar)
on Feb 01, 2013 at 10:22 UTC ( #1016504=perlquestion: print w/ replies, xml ) Need Help??
larryk has asked for the wisdom of the Perl Monks concerning the following question:

Hi! It has been some time since my last post but I haven't forgotten about you guys :)

Bit of an oddball query, this one... but I bet someone knows one or two ways to do it.

I'm creating search indexes with Perl for my Chinese dictionary mobile app and am hitting a problem where the (Perl) sorted keys - Chinese characters - are not in exactly the same order as the binary search on-device is expecting them to be.

In real terms, there are 165910 index records and 188 of them are unreachable on-device because the sort order is slightly different between Perl's standard string sort and C#'s String.Compare function (with "en-US" culture).

I've played around for weeks with the culture settings and this 188 unreachable number is the optimal result I have achieved. So 2 questions:

1. (long shot) has anyone seen this issue before so knows the magic incantation to get Perl and C# to agree 100% on sort order for UTF-8?

2. (failing that) how do I get Perl to sort by Unicode code point (i.e. the raw underlying \u{xxxx} value, because I can probably force C# to do it that way as an exception for this index?

Any help much appreciated.

   larryk                                          
perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Comment on sorting Chinese characters
Re: sorting Chinese characters
by choroba (Abbot) on Feb 01, 2013 at 10:28 UTC
    Which modules and functions from Task::Unicode have you used?
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      I found and tried Unicode::Collate::Locale->new(locale => 'en-US')->sort() instead of regular Perl sort() and that has improved the ratio of reachable to unreachable keys in the index. Now there are only 126 unreachable (compared to 188, earlier) out of a total of ~166k entries.

      It's good, but not good enough. I could live with 0.1% droppage from the index except that one of the dropped keys is '⼀' which is the Chinese character meaning 'one'. That's a noddy mistake in a Chinese-English dictionary application so I need to still fix it.

      Problem is that I think I'm almost at the this must be a subtle difference in the Unicode tables of Perl and C# stage, so perhaps you have some specific experience sorting Unicode, or were you just suggesting I look at this package of libraries?

         larryk                                          
      perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
      
        Why are you using English locale for Chinese?

        It would not be surprising if the Unicode support was weaker in C# than in Perl, but I have no experience witn C# and it is not mentioned in Unicode Good, Bad, & Ugly.

        I have no experience with sorting Chinese. I have studied articles like What's wrong with sort and how to fix it, though, and at work I am dealing mostly with Czech, which fortunately uses Latin letters (plus some less common diacritics like ř or ů), but whose "official" sorting algorithm is unfortunately practically unimplementable (e.g. numbers should be sorted as pronounced).

        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: sorting Chinese characters
by Anonymous Monk on Feb 01, 2013 at 10:41 UTC
Re: sorting Chinese characters
by punch_card_don (Curate) on Feb 01, 2013 at 13:39 UTC
    If it were me, I think I'd step back from the problem a moment and ask myself if trying to synchronize two independent sorting mechanisms is really what I want. Sounds like a potential source of on-going headaches.

    Would assigning a unique identifier that defines order on both systems work in the given situation?




    Time flies like an arrow. Fruit flies like a banana.
      The key in the index is unique the Unicode standard is supposed to define the order. I was rather hoping I could depend on that, since it is supposed to be a standard.

      Looks like my options are:

      • I could use the ordinal value of the characters and do a numeric sort instead of a textual sort. That's what I was suggesting in point 2 above.
      • The only other option is to ditch my Perl index creation and rewrite it in C# so that the index creation and the index usage are both using exactly the same sorting library. Not sure I want to go there yet, though.
      Any other options?
         larryk                                          
      perl -le "s,,reverse killer,e,y,rifle,lycra,,print"
      

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1016504]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (19)
As of 2014-04-16 14:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (430 votes), past polls