Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^4: Sorting Vietnamese text

by moritz (Cardinal)
on Dec 22, 2013 at 21:04 UTC ( #1068122=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Sorting Vietnamese text
in thread Sorting Vietnamese text

I think that getting Unicode::Collate to work would be the best approach, but here's a hand-rolled one that seems to work the way you want it:

use utf8;
use 5.014;
use warnings;
use List::Util qw/min/;
binmode STDOUT, ':encoding(UTF-8)';

my %order;
{
    my $source = join '', 'aáàảãạăaáàảãạăắ',
                 'ằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễ',
                 'ệfghiíìỉĩịjklmnoóòỏõọôốồổ',
                 'ỗộơớờởỡợpqrstuúùủũụưứừửữự',
                 'vwxyýỳỷỹỵz';
    my $cnt = 0;
    $order{$_} = ++$cnt for split //, $source;
    sub vcmp($$) {
        my ($a, $b) = @_;
        for (0..min(length($a), length($b))) {
            my $cmp = ($order{substr $a, $_, 1} // 0)
                      <=> ($order{ substr $b, $_, 1 } // 0);
            return $cmp if $cmp != 0;
        }
        return length($a) <=> length($b);
    }
}

say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');


Comment on Re^4: Sorting Vietnamese text
Re^5: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 16:05 UTC
    Thanks! That looks like it might be working correctly.

    Now to ask a dumb question - how do I replace the hard-coded list here

    say for sort { vcmp($a, $b) } ('ầm', 'm', 'ấm ch', 'm số');
    
    with an array like this:

    
    ỷ : (1) to be fat (said of a pig); (2) to depend on
    ỳ : inertia, state of inactivity, stay out, inert, sluggish
    ỳ ạch : to toil, labor with difficulty
    ỷ eo : reproach someone with something
    ỷ lại : to depend, rely on others
    ỷ thế : count on ones power, ones position, ones influence
    yu nhau : to love each other, be in love
    yu qu : precious, valuable
    

    read in from a text file using something like

    open(IN,"test.txt");
    @unsorted=<IN>;
    

    Sorry, I know that's a dumb question, but I keep getting errors saying "Global symbol XYZ requires explicit package name ..." when I try to do this.

    Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068122]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (15)
As of 2014-09-19 15:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (142 votes), past polls