http://www.perlmonks.org?node_id=1068106


in reply to Sorting Vietnamese text

Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this.

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
Then the sort method will work as intended. Try it with actual Vietnamese words.

Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;

use Unicode::Collate::Locale;
my $Collator = Unicode::Collate::Locale->new('vi');

my @unsorted = qw(
                  a..7
                  ả..3
                  à..9
                  ạ..5
                  ã..4
                  á..1
                  ă..6
                  à..2
                  á..8
                 );

my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Output is as follows.
unsorted
a..7 ả..3 à..9 ạ..5 ã..4 á..1 ă..6 à..2 á..8
sorted
á..1 à..2 ả..3 ã..4 ạ..5 ă..6 a..7 á..8 à..9

Update #2: The code below actually is a correct example.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');

my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Giving the output:
unsorted
á ả ã à ậ ă ạ ẫ a ẩ
sorted
a à ả ã á ạ ă ẩ ẫ ậ

Replies are listed 'Best First'.
Re^2: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:07 UTC

    Giving the output:

    unsorted á ả ã à ậ ă ạ ẫ a ẩ

    sorted a à ả ã á ạ ă ẩ ẫ ậ

    Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

      (1) that's still not the correct sort order (á should come before à)
      I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
      (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
      Perhaps it has to do with normalization. I still get the same sort order when using it.
      #!/usr/bin/env perl
      use v5.14;
      use warnings;
      use utf8::all;
      use Unicode::Collate::Locale;
      use Unicode::Normalize;
      
      my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
      my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
      @unsorted = map { NFD($_) } @unsorted;
      my @sorted = $Collator->sort(@unsorted);
      
      say NFC("unsorted\n@unsorted");
      say NFC("sorted\n@sorted");
      

        Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

        This is what I've been struggling with for a LONG time :)

        unsorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         ỷ lại : to depend, rely on others
         ỷ thế : count on one’s power, one’s position, one’s influence
         yêu nhau : to love each other, be in love
         yêu quí : precious, valuable
        

        sorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         yêu nhau : to love each other, be in love
         yêu quí : precious, valuable
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỷ lại : to depend, rely on others
         ỷ thế : count on one’s power, one’s position, one’s influence
        Thanks! I'll give it a try.

        When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

        a á à ả ã ạ

        Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

        There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

        Thanks again!