Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Sorting Vietnamese text

by karlgoethebier (Curate)
on Dec 22, 2013 at 18:20 UTC ( #1068105=note: print w/ replies, xml ) Need Help??


in reply to Sorting Vietnamese text

"...Vietnamese text file that I would like to sort..."

Perhaps you can show (something from) the text file as well as an intuitive example what you expect/like to do...?

Regards, Karl

The Crux of the Biscuit is the Apostrophe


Comment on Re: Sorting Vietnamese text
Re^2: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:03 UTC
    Here's an example of a short list of words and definitions that I want to sort in the order described above

    ầm : loud, noisy

    m : to carry in the arms

    ấm ch : teapot

    m số : password, code

    should be

    m số : password, code

    ấm ch : teapot

    ầm : loud, noisy

    m : to carry in the arms

      Correction: should be

      m số : password, code

      m : to carry in the arms

      ấm ch : teapot

      ầm : loud, noisy

        I think that getting Unicode::Collate to work would be the best approach, but here's a hand-rolled one that seems to work the way you want it:

        use utf8;
        use 5.014;
        use warnings;
        use List::Util qw/min/;
        binmode STDOUT, ':encoding(UTF-8)';
        
        my %order;
        {
            my $source = join '', 'aáàảãạăaáàảãạăắ',
                         'ằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễ',
                         'ệfghiíìỉĩịjklmnoóòỏõọôốồổ',
                         'ỗộơớờởỡợpqrstuúùủũụưứừửữự',
                         'vwxyýỳỷỹỵz';
            my $cnt = 0;
            $order{$_} = ++$cnt for split //, $source;
            sub vcmp($$) {
                my ($a, $b) = @_;
                for (0..min(length($a), length($b))) {
                    my $cmp = ($order{substr $a, $_, 1} // 0)
                              <=> ($order{ substr $b, $_, 1 } // 0);
                    return $cmp if $cmp != 0;
                }
                return length($a) <=> length($b);
            }
        }
        
        say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');
        
        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068105]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2014-09-24 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (243 votes), past polls