Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: Sorting Vietnamese text

by farang (Hermit)
on Dec 22, 2013 at 23:48 UTC ( #1068127=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Sorting Vietnamese text
in thread Sorting Vietnamese text

(1) that's still not the correct sort order ( should come before )
I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
(2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
Perhaps it has to do with normalization. I still get the same sort order when using it.
#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
use Unicode::Normalize;

my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
my @unsorted = ('', 'ả', '', '', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
@unsorted = map { NFD($_) } @unsorted;
my @sorted = $Collator->sort(@unsorted);

say NFC("unsorted\n@unsorted");
say NFC("sorted\n@sorted");


Comment on Re^3: Sorting Vietnamese text
Re^4: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 00:08 UTC
    Thanks! I'll give it a try.

    When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

    a ả ạ

    Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

    There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

    Thanks again!

Re^4: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 02:37 UTC
    Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

    This is what I've been struggling with for a LONG time :)

    unsorted
     ỷ : (1) to be fat (said of a pig); (2) to depend on
     ỳ : inertia, state of inactivity, stay out, inert, sluggish
     ỳ ạch : to toil, labor with difficulty
     ỷ eo : reproach someone with something
     ỷ lại : to depend, rely on others
     ỷ thế : count on ones power, ones position, ones influence
     yu nhau : to love each other, be in love
     yu qu : precious, valuable
    

    sorted
     ỷ : (1) to be fat (said of a pig); (2) to depend on
     ỳ ạch : to toil, labor with difficulty
     ỷ eo : reproach someone with something
     yu nhau : to love each other, be in love
     yu qu : precious, valuable
     ỳ : inertia, state of inactivity, stay out, inert, sluggish
     ỷ lại : to depend, rely on others
     ỷ thế : count on ones power, ones position, ones influence

      Are you absolutely certain your text is Unicode (UTF-8)? It's not TCVN (CP1258, ISO-2022-VN or EUC-VN), is it?

      I'm sorry if this is an "Is the power cord plugged in?" kind of question, but it just doesn't make sense that you're getting different output than farang got.

      Jim

      sorted
       ỷ : (1) to be fat (said of a pig); (2) to depend on
       ỳ ạch : to toil, labor with difficulty
       ỷ eo : reproach someone with something
       yu nhau : to love each other, be in love
       yu qu : precious, valuable
       ỳ : inertia, state of inactivity, stay out, inert, sluggish
       ỷ lại : to depend, rely on others
       ỷ thế : count on ones power, ones position, ones influence
      
      Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
      sorted
      ỳ :
      ỷ :
      ỳ ạch :
      ỷ eo :
      yu nhau :
      yu qu :
      ỷ lại :
      ỷ thế :
      
      What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

      I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.

        Thanks again, but what I'm looking for is that every word staring with ỳ comes before any word starting with ỷ, so even that sort order isn't quite right, Also why are all the entries with ỷ not together?. Instead of
        sorted
        ỳ :
        ỷ :
        ỳ ạch :
        ỷ eo :
        yu nhau :
        yu qu :
        ỷ lại :
        ỷ thế :
        

        should be

        sorted
        ỳ :
        ỳ ạch :
        ỷ :
        ỷ eo :
        ỷ lại :
        ỷ thế :
        yu nhau :
        yu qu :
        
        This is how all paper dictionaries do it, regardless of which order they use for the tone marks. I'm beginning to wonder if I'm the only person who's ever cared about this before :)

        By the way, the reason I'm doing this is that I'm planning to release a large (>50,000 words) Vietnamese-English dictionary (as a single UTF8 file) under the CC license (essentially free to use for any purpose) and I'd like to make it available in "properly" sorted order. I've done similar projects for Chinese, Esperanto, and Interlinga already (see www.denisowski.org), but those are a lot easier to sort :)

        Any other ideas? Thanks again for the help!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068127]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2014-12-19 04:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (70 votes), past polls