Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

New Alphabet Sort Order

by Polyglot (Monk)
on Apr 03, 2011 at 17:07 UTC ( #897216=perlquestion: print w/ replies, xml ) Need Help??
Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

I've been asked to help with a project involving some Lao script. I need to alphabetize lists of words in Lao. However, Lao characters are only barely defined in Perl, e.g. \p{InLao} to identify a Lao character, and I have been unable to find a predefined localedef or similar for Lao. Searching perlmonks revealed virtually nothing on localedef, and as it turns out, perl may use it, but it seems to come from a C library.

It appears a new Lao alphabet routine is needed. I may have to generate the rules for alphabetizing...here's the tough part: Lao is not a typical job for an alphabetic sort.

  1. Lao words are first sorted by consonant order.
  2. Vowels follow consonants in terms of alphabetical order, but not necessarily in terms of chronological order. For example, some vowels appear before the consonant even though they are pronounced after the consonant, and the alphabetical order follows pronunciation.
  3. After the typical list of single-character consonants, Lao has some "diphthong" consonants (double-character ones) which have their own alphabetical placements.
All of this adds up to a challenging puzzle for a perl enthusiast. I welcome your thoughts on how this could be done, and/or how it should be done in a way that would follow standard practice and be able to serve the entire Perl community for Lao script.

I have already developed a "Lao.pm" module (not yet submitted to CPAN, and may need to use a different namespace) that will identify Lao characters by consonant, vowel, punctuation, and tone marks, and will further classify the consonants by their Lao classes (high/mid/low). So I have the tools for distinguishing at the character level, e.g. \p{Lao::InLaoCons}\p{Lao::InLaoTone}\p{Lao::InLaoVowel}, but need to map the characters to an alphabetical order, and this part seems beyond my experience.

Blessings,

~Polyglot~

Comment on New Alphabet Sort Order
Re: New Alphabet Sort Order
by Corion (Pope) on Apr 03, 2011 at 17:15 UTC

    I would take a look at the Collate name space, especially Unicode::Collate. That module provides string sorting for some other "weird" alphabets already, and even if you can't use it, its API might be enough for you to start your own string comparison.

Re: New Alphabet Sort Order
by Khen1950fx (Canon) on Apr 03, 2011 at 17:27 UTC
    Alo might prove to be useful for you. I'd give it a ++.
      Khen, thank you for the tip. I took a look at that and asked up the line about it. I guess many people in Laos still use that font system. However, for this project, we are intending to use unicode. Unicode still seems a bit new to Lao, so it will not be without some difficulties. But we feel it may be the best and most future-compatible standard to follow.

      Blessings,

      ~Polyglot~

        Polyglot,

        I think BrowserUK's solution is the direction you'll end up going. The ST and GRT he referred to are the Schwartzian Transform and Guttman Rosler Transform, respectively. You only need the first.

        The link explains it, but basically you transform a list of things you want to sort into a list of two-element arrays. The first element is the key, set up so you can easily sort it. The second is the original element, untouched. You can do it in separate steps, but the transform is more efficient if you have a lot of elements. The key thing you need is a function that can create a function that will create a key that you can sort on.

        Here's an example, sorting movie names, done separately first.
        use Lingua::EN::Numbers qw(num2en); sub make_key { $_ = shift; s/^(?:The|An|A) // || s/^[^A-Z_]+(\d+)/num2en; return $_; } my @movies = ( '(500) Days of Summer', # F for Five hundred 'The Music Man', # M for Music 'The Good, the Bad, and the Ugly' # G for Good ); my @tmp = (); for (@movies) { push @tmp [ make_key($_), $_ ] # 2-element anonymous array } @tmp = sort { $a->[0] cmp $b->[0] } @tmp; # sort on first elements @movies = map { $_->[1] } @tmp; # pull off second element from ea +ch # anonymous array print "$_\n" for @movies; __END__ Prints: (500) Days of Summer The Good, the Bad, and the Ugly The Music Man
        Now, MUCH less complicated:
        use Lingua::EN::Numbers qw(num2en); sub make_key { $_ = shift; s/^(?:The|An|A) // || s/^[^A-Z_]+(\d+)/num2en; return $_; } my @movies = ( '(500) Days of Summer', # F for Five hundred 'The Music Man', # M for Music 'The Good, the Bad, and the Ugly' # G for Good ); # Here's the transform. Read from the bottom up. my @movies = # original elements replaced with same +, but sorted map { $_->[1] } # pull off second element from each sort { $a->[0] cmp $b->[0] } # sort arrays on first elements map { [ make_key($_), $_ ] } # 2-element anonymous array becomes ne +w $_ @movies; print "$_\n" for @movies; __END__ Prints: (500) Days of Summer The Good, the Bad, and the Ugly The Music Man
        Good luck!

        --marmot

        UPDATE: Corrected a typo bug in the ST that came from copying the first version.
Re: New Alphabet Sort Order
by BrowserUk (Pope) on Apr 03, 2011 at 17:35 UTC

    In essence, all you need is a custom string comparison routine. And the easiest, most efficient way to do that in Perl is to use the tr/// operator to 'encode' the strings and then use the standard cmp operator.

    Say your custom sort rules call for 0-4 to be sorted before alphas, and 5-9 after. And within the alphas, you want upper and lower case of any given character to be sorted together. Then you set up a mapping that maps the original strings to characters that will sort in the required order.

    For the given example:

    tr[0-4AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz5-9][\x00-\x +ff];
    will do the trick. Now, you just transliterate your string and sort in the normal way:
    #! perl -slw use strict; sub trans { my $in = shift; $in =~ tr[0-4AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz5 +-9] [\x00-\xff]; return $in; } chomp( my @data = <DATA> ); my @sorted = sort{ trans( $a ) cmp trans( $b ) } @data; print for @sorted; __DATA__ cdef 0123456 abcd 50011 ABCD 4999 Zxyw CDEF zxyw 9999

    Produces:

    c:\test>junk78 0123456 4999 ABCD abcd CDEF cdef Zxyw zxyw 50011 9999

    Of course, you can now apply all the usual forms of sort optimisations--ST, GRT etc.--to that, but the mechanism remains the same.

    The ugly head of UTF will probably complicate things a little, but, at least in the eyes of my non-UTF aware brain, it should be possible to apply the same mechanism.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Because some of the alphabetical order is dependent upon character combinations (more than one character together), the tr/// approach is not adequate, though a nice idea. I may need to use something more along the lines of mapping characters and sequences, and then sorting based on that map.

      Blessings,

      ~Polyglot~

        Because some of the alphabetical order is dependent upon character combinations (more than one character together), the tr/// approach is not adequate,

        Okay, but you can still use essentially the same mechanism. Just set up a hash with the ordering and a re to pick out the 'characters'. Then use the re in combination with the hash to perform the mapping. This way, you should be able to cater for any mapping you can describe.

        If you run this, you'll see the numbers sorted before the consonants before the vowels before the (artificial) dipthongs (CH, SH, TH, WH):

        #! perl -slw use strict; my @order = ( 0 .. 9, 'B'..'D', 'F'..'H', 'J'..'N', 'P'..'T', 'V'..'Z', 'b'..'d', 'f'..'h', 'j'..'n', 'p'..'t', 'v'..'z', 'A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u', 'CH', 'SH', 'TH', 'WH', 'ch', 'sh', 'th', 'wh', ); my $re = join '|', sort{ length $b <=> length $a } @order; my $n = 0; my %map = map{ $_ => chr( $n++ ) } @order; sub trans { my $in = shift; $in =~ s[($re)]{ $map{ $1 } }ge; return $in; } chomp( my @data = map{ split ' ' } <DATA> ); my @sorted = sort{ trans( $a ) cmp trans( $b ) } @data; print for @sorted; __DATA__ I've been asked to help with a project involving some Lao script. I need to alphabetize lists of words in Lao. However, Lao characters are only barely defined in Perl, e.g. \p{InLao} to identify a Lao character, and I have been unable to find a predefined localedef or similar for L +ao. Searching perlmonks revealed virtually nothing on localedef, and as it turns out, perl may use it, but it seems to come from a C li +brary. It appears a new Lao alphabet routine is needed. I may have to generate the rules for alphabetizing...here's the tough +part: Lao is not a typical job for an alphabetic sort. Lao words are first sorted by consonant order. Vowels follow consonants in terms of alphabetical order, but not necessarily in terms of chronological order. For example, some vowels appear before the consonant even though they are pronounced after the consonant, and the alphabetical order follows pronunciation. After the typical list of single-character consonants, Lao has some "diphthong" consonants (double-character ones) which have their own alphabetical placements. All of this adds up to a challenging puzzle for a perl enthusiast. I welcome your thoughts on how this could be done, and/or how it should be done in a way that would follow standard pract +ice and be able to serve the entire Perl community for Lao script. I have already developed a "Lao.pm" module (not yet submitted to CPAN, and may need to use a different namespace) that will identify Lao characters by consonant, vowel, punctuation, and tone marks, and will further classify the consonants by their Lao +classes (high/mid/low). So I have the tools for distinguishing at the characte +r level, e.g. \p{Lao::InLaoCons}\p{Lao::InLaoTone}\p{Lao::InLaoVowel}, but need to map the characters to an alphabetical order, and this part seems beyond my experience.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: New Alphabet Sort Order
by davies (Vicar) on Apr 03, 2011 at 22:32 UTC

    Have a look to see if anyone has done anything about Welsh. I believe that the Welsh alphabet starts A B C CH D DD E F FF, using letter combinations as single letters. I speak very little, but it was my father's first language. His father couldn't speak decent English until the day he died.

    Regards,

    John Davies

      You might also look at Spanish in the same way. As I recall from my Spanish classes at school, CH is treated as a first class letter for the purposes of ordering and gets a separate section in alphabetical lists such as dictionaries.

Re: New Alphabet Sort Order
by Anonymous Monk on Apr 04, 2011 at 11:11 UTC
    In the past I have made sorting routines for Sanskrit and other Indian languages, which has the same alphabetical structure as Lao (Lao script is actually a derivate of the Indian script system). I found it much too tedious to try and sort the Indian script itself, represented in Unicode, but instead I first had Perl change the text into a Roman transliteration, sort it, and then turn it back into Unicode. Of course, you must make provisions to move prefixed vowels to behind any consonants in the transliteration. Then I think BrowserUk's second approach will work just fine (I myself used a much more crude and inelegant solution).
Re: New Alphabet Sort Order
by thundergnat (Deacon) on Apr 04, 2011 at 17:18 UTC

    A little late to the game perhaps, but you may want to take a look at Sort::ArbBiLex, it is designed to handle exactly these kind of situations.

      I was out of town for awhile and just getting back to this. Your "late in the game" post was a tremendous help to me here. I think this is the sort of solution I was looking for. In studying that package, I see that it does use the Schwartzian transform recommended by others here. Furthermore, there are some who have written nice articles on how to use this package for sorting various language alphabets in Perl. Here is one of them, titled "International Sorting with Perl's sort."

      Again, thank you and ++ to this. There may still be some conundrums with Lao, but I am looking into those, chiefly compound vowels.

      Blessings,

      ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://897216]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2014-09-02 21:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (30 votes), past polls