Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Sorting Vietnamese text

by pdenisowski (Acolyte)
on Dec 22, 2013 at 20:03 UTC ( [id://1068113]=note: print w/replies, xml ) Need Help??


in reply to Re: Sorting Vietnamese text
in thread Sorting Vietnamese text

Here's an example of a short list of words and definitions that I want to sort in the order described above

ầm : loud, noisy

ãm : to carry in the arms

ấm chè : teapot

ám số : password, code

should be

ám số : password, code

ấm chè : teapot

ầm : loud, noisy

ãm : to carry in the arms

Replies are listed 'Best First'.
Re^3: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:11 UTC
    Correction: should be

    ám số : password, code

    ãm : to carry in the arms

    ấm chè : teapot

    ầm : loud, noisy

      I think that getting Unicode::Collate to work would be the best approach, but here's a hand-rolled one that seems to work the way you want it:

      use utf8;
      use 5.014;
      use warnings;
      use List::Util qw/min/;
      binmode STDOUT, ':encoding(UTF-8)';
      
      my %order;
      {
          my $source = join '', 'aáàảãạăaáàảãạăắ',
                       'ằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễ',
                       'ệfghiíìỉĩịjklmnoóòỏõọôốồổ',
                       'ỗộơớờởỡợpqrstuúùủũụưứừửữự',
                       'vwxyýỳỷỹỵz';
          my $cnt = 0;
          $order{$_} = ++$cnt for split //, $source;
          sub vcmp($$) {
              my ($a, $b) = @_;
              for (0..min(length($a), length($b))) {
                  my $cmp = ($order{substr $a, $_, 1} // 0)
                            <=> ($order{ substr $b, $_, 1 } // 0);
                  return $cmp if $cmp != 0;
              }
              return length($a) <=> length($b);
          }
      }
      
      say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');
      
      
        Thanks! That looks like it might be working correctly.

        Now to ask a dumb question - how do I replace the hard-coded list here

        say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');
        
        with an array like this:

        
        ỷ : (1) to be fat (said of a pig); (2) to depend on
        ỳ : inertia, state of inactivity, stay out, inert, sluggish
        ỳ ạch : to toil, labor with difficulty
        ỷ eo : reproach someone with something
        ỷ lại : to depend, rely on others
        ỷ thế : count on one’s power, one’s position, one’s influence
        yêu nhau : to love each other, be in love
        yêu quí : precious, valuable
        

        read in from a text file using something like

        open(IN,"test.txt");
        @unsorted=<IN>;
        

        Sorry, I know that's a dumb question, but I keep getting errors saying "Global symbol XYZ requires explicit package name ..." when I try to do this.

        Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1068113]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-03-19 07:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found