Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: Sorting Vietnamese text

by pdenisowski (Acolyte)
on Dec 22, 2013 at 20:07 UTC ( #1068114=note: print w/ replies, xml ) Need Help??


in reply to Re: Sorting Vietnamese text
in thread Sorting Vietnamese text

Giving the output:

unsorted ả ậ ă ạ ẫ a ẩ

sorted a ả ạ ă ẩ ẫ ậ

Thanks, but there are two issues : (1) that's still not the correct sort order ( should come before ), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.


Comment on Re^2: Sorting Vietnamese text
Re^3: Sorting Vietnamese text
by farang (Hermit) on Dec 22, 2013 at 23:48 UTC

    (1) that's still not the correct sort order ( should come before )
    I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
    (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
    Perhaps it has to do with normalization. I still get the same sort order when using it.
    #!/usr/bin/env perl
    use v5.14;
    use warnings;
    use utf8::all;
    use Unicode::Collate::Locale;
    use Unicode::Normalize;
    
    my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
    my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
    @unsorted = map { NFD($_) } @unsorted;
    my @sorted = $Collator->sort(@unsorted);
    
    say NFC("unsorted\n@unsorted");
    say NFC("sorted\n@sorted");
    

      Thanks! I'll give it a try.

      When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

      a ả ạ

      Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

      There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

      Thanks again!

      Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

      This is what I've been struggling with for a LONG time :)

      unsorted
       ỷ : (1) to be fat (said of a pig); (2) to depend on
       ỳ : inertia, state of inactivity, stay out, inert, sluggish
       ỳ ạch : to toil, labor with difficulty
       ỷ eo : reproach someone with something
       ỷ lại : to depend, rely on others
       ỷ thế : count on ones power, ones position, ones influence
       yu nhau : to love each other, be in love
       yu qu : precious, valuable
      

      sorted
       ỷ : (1) to be fat (said of a pig); (2) to depend on
       ỳ ạch : to toil, labor with difficulty
       ỷ eo : reproach someone with something
       yu nhau : to love each other, be in love
       yu qu : precious, valuable
       ỳ : inertia, state of inactivity, stay out, inert, sluggish
       ỷ lại : to depend, rely on others
       ỷ thế : count on ones power, ones position, ones influence

        Are you absolutely certain your text is Unicode (UTF-8)? It's not TCVN (CP1258, ISO-2022-VN or EUC-VN), is it?

        I'm sorry if this is an "Is the power cord plugged in?" kind of question, but it just doesn't make sense that you're getting different output than farang got.

        Jim

        sorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         yu nhau : to love each other, be in love
         yu qu : precious, valuable
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỷ lại : to depend, rely on others
         ỷ thế : count on ones power, ones position, ones influence
        
        Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
        sorted
        ỳ :
        ỷ :
        ỳ ạch :
        ỷ eo :
        yu nhau :
        yu qu :
        ỷ lại :
        ỷ thế :
        
        What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

        I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068114]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-07-25 09:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (170 votes), past polls