Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^4: Sorting Vietnamese text

by moritz (Cardinal)
on Dec 22, 2013 at 21:04 UTC ( #1068122=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Sorting Vietnamese text
in thread Sorting Vietnamese text

I think that getting Unicode::Collate to work would be the best approach, but here's a hand-rolled one that seems to work the way you want it:

use utf8;
use 5.014;
use warnings;
use List::Util qw/min/;
binmode STDOUT, ':encoding(UTF-8)';

my %order;
{
    my $source = join '', 'aáàảãạăaáàảãạăắ',
                 'ằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễ',
                 'ệfghiíìỉĩịjklmnoóòỏõọôốồổ',
                 'ỗộơớờởỡợpqrstuúùủũụưứừửữự',
                 'vwxyýỳỷỹỵz';
    my $cnt = 0;
    $order{$_} = ++$cnt for split //, $source;
    sub vcmp($$) {
        my ($a, $b) = @_;
        for (0..min(length($a), length($b))) {
            my $cmp = ($order{substr $a, $_, 1} // 0)
                      <=> ($order{ substr $b, $_, 1 } // 0);
            return $cmp if $cmp != 0;
        }
        return length($a) <=> length($b);
    }
}

say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');


Comment on Re^4: Sorting Vietnamese text
Re^5: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 16:05 UTC
    Thanks! That looks like it might be working correctly.

    Now to ask a dumb question - how do I replace the hard-coded list here

    say for sort { vcmp($a, $b) } ('ầm', 'm', 'ấm ch', 'm số');
    
    with an array like this:

    
    ỷ : (1) to be fat (said of a pig); (2) to depend on
    ỳ : inertia, state of inactivity, stay out, inert, sluggish
    ỳ ạch : to toil, labor with difficulty
    ỷ eo : reproach someone with something
    ỷ lại : to depend, rely on others
    ỷ thế : count on ones power, ones position, ones influence
    yu nhau : to love each other, be in love
    yu qu : precious, valuable
    

    read in from a text file using something like

    open(IN,"test.txt");
    @unsorted=<IN>;
    

    Sorry, I know that's a dumb question, but I keep getting errors saying "Global symbol XYZ requires explicit package name ..." when I try to do this.

    Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068122]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2015-07-06 02:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (69 votes), past polls