Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Sorting Vietnamese text

by farang (Chaplain)
on Dec 22, 2013 at 18:27 UTC ( #1068106=note: print w/ replies, xml ) Need Help??


in reply to Sorting Vietnamese text

Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this.

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
Then the sort method will work as intended. Try it with actual Vietnamese words.

Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;

use Unicode::Collate::Locale;
my $Collator = Unicode::Collate::Locale->new('vi');

my @unsorted = qw(
                  a..7
                  ả..3
                  ..9
                  ạ..5
                  ..4
                  ..1
                  ă..6
                  ..2
                  ..8
                 );

my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Output is as follows.
unsorted
a..7 ả..3 ..9 ạ..5 ..4 ..1 ă..6 ..2 ..8
sorted
..1 ..2 ả..3 ..4 ạ..5 ă..6 a..7 ..8 ..9

Update #2: The code below actually is a correct example.

#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');

my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
my @sorted = $Collator->sort(@unsorted);

say "unsorted\n@unsorted";
say "sorted\n@sorted";
Giving the output:
unsorted
 ả   ậ ă ạ ẫ a ẩ
sorted
a  ả   ạ ă ẩ ẫ ậ


Comment on Re: Sorting Vietnamese text
Download Code
Replies are listed 'Best First'.
Re^2: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:07 UTC

    Giving the output:

    unsorted ả ậ ă ạ ẫ a ẩ

    sorted a ả ạ ă ẩ ẫ ậ

    Thanks, but there are two issues : (1) that's still not the correct sort order ( should come before ), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

      (1) that's still not the correct sort order ( should come before )
      I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
      (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
      Perhaps it has to do with normalization. I still get the same sort order when using it.
      #!/usr/bin/env perl
      use v5.14;
      use warnings;
      use utf8::all;
      use Unicode::Collate::Locale;
      use Unicode::Normalize;
      
      my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
      my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
      @unsorted = map { NFD($_) } @unsorted;
      my @sorted = $Collator->sort(@unsorted);
      
      say NFC("unsorted\n@unsorted");
      say NFC("sorted\n@sorted");
      

        Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

        This is what I've been struggling with for a LONG time :)

        unsorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         ỷ lại : to depend, rely on others
         ỷ thế : count on ones power, ones position, ones influence
         yu nhau : to love each other, be in love
         yu qu : precious, valuable
        

        sorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         yu nhau : to love each other, be in love
         yu qu : precious, valuable
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỷ lại : to depend, rely on others
         ỷ thế : count on ones power, ones position, ones influence
        Thanks! I'll give it a try.

        When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

        a ả ạ

        Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

        There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

        Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068106]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2015-07-08 08:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (96 votes), past polls