Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Sorting Vietnamese text

by farang (Hermit)
on Dec 22, 2013 at 18:27 UTC ( #1068106=note: print w/ replies, xml ) Need Help??


in reply to Sorting Vietnamese text

Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this.

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
Then the sort method will work as intended. Try it with actual Vietnamese words.

Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;

use Unicode::Collate::Locale;
my $Collator = Unicode::Collate::Locale->new('vi');

my @unsorted = qw(
                  a..7
                  ả..3
                  ..9
                  ạ..5
                  ..4
                  ..1
                  ă..6
                  ..2
                  ..8
                 );

my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Output is as follows.
unsorted
a..7 ả..3 ..9 ạ..5 ..4 ..1 ă..6 ..2 ..8
sorted
..1 ..2 ả..3 ..4 ạ..5 ă..6 a..7 ..8 ..9

Update #2: The code below actually is a correct example.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');

my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Giving the output:
unsorted
 ả   ậ ă ạ ẫ a ẩ
sorted
a  ả   ạ ă ẩ ẫ ậ


Comment on Re: Sorting Vietnamese text
Download Code
Re^2: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:07 UTC

    Giving the output:

    unsorted ả ậ ă ạ ẫ a ẩ

    sorted a ả ạ ă ẩ ẫ ậ

    Thanks, but there are two issues : (1) that's still not the correct sort order ( should come before ), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

      (1) that's still not the correct sort order ( should come before )
      I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
      (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
      Perhaps it has to do with normalization. I still get the same sort order when using it.
      #!/usr/bin/env perl
      use v5.14;
      use warnings;
      use utf8::all;
      use Unicode::Collate::Locale;
      use Unicode::Normalize;
      
      my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
      my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
      @unsorted = map { NFD($_) } @unsorted;
      my @sorted = $Collator->sort(@unsorted);
      
      say NFC("unsorted\n@unsorted");
      say NFC("sorted\n@sorted");
      

        Thanks! I'll give it a try.

        When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

        a ả ạ

        Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

        There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

        Thanks again!

        Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

        This is what I've been struggling with for a LONG time :)

        unsorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         ỷ lại : to depend, rely on others
         ỷ thế : count on ones power, ones position, ones influence
         yu nhau : to love each other, be in love
         yu qu : precious, valuable
        

        sorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         yu nhau : to love each other, be in love
         yu qu : precious, valuable
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỷ lại : to depend, rely on others
         ỷ thế : count on ones power, ones position, ones influence

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068106]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2014-10-02 04:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (49 votes), past polls