Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Sorting according to locale collation

by amir_e_a (Hermit)
on Apr 22, 2007 at 10:52 UTC ( #611341=perlquestion: print w/ replies, xml ) Need Help??
amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

(This message uses special characters, which may be displayed incorrectly on your system:

  • Į is supposed to look like a capital I with a tail below.
  • į is supposed to look like a small i with a tail below.
  • ž is supposed to look like a small z with a v above.
  • č is supposed to look like a small c with a v above.

Thanks for understanding.)

Hi,

I'm writing a Perl program that processes a text in the Lithuanian language.

Lithuanian uses the Latin alphabet with some modifications. One of the curious properties of this language is the alphabetic order. There are three variations of the letter I: I, Y and Į (the last one is supposed to look like an I with a tail). They are considered as one and the same letter when words are sorted alphabetically, so the Y doesn't come between X and Z and not even between I and J, but is mixed with I. For example, the following list is taken straight from a dictionary. :

  • inžinerius
  • ypač
  • įpainiojimas
  • įpakavimas
  • įpareigojimas
  • ypatybe
  • įpedin

As you see, these words are sorted as if they were all written with an I.

As far as i understand, Perl's sort function uses the locale to determine the collation order. I'm trying to set a Lithuanian locale by saying

use locale; use POSIX; setlocale(LC_ALL, 'lt');
... And then i run:
my @sorttest = qw(ia ib ic ya yb yc); for (sort @sorttest) { print "$_\n"; }

According to Lithuanian rules this should have printed:
ia
ya
ib
yb
ic
yc

... but instead it prints:
ia
ib
ic
ya
yb
yc

Am i doing something wrong? Or maybe Perl just doesn't know about the different locales? Or maybe Perl is supposed to know, but it is a bug?

I tried the above code with ActivePerl 5.8.8 build 820 on Windows XP SP2 and with Perl 5.8.0 on Red Hat Enterprise Linux kernel 2.4.21 (i'm not sure which version of RHEL exactly it is). The results were the same.

Thanks in advance for any help.

Comment on Sorting according to locale collation
Select or Download Code
Re: Sorting according to locale collation
by Krambambuli (Deacon) on Apr 22, 2007 at 11:54 UTC
    'locale' is almost entirely dependent on the underlying OS, and then further on the libs and installed modules. Have a look on your perllocale manual to check how to tackle.

    I think you might do 2 things to go on:

    1) Check what locale settings are installed/available to you in the working environment (perllocale suggests a number of ways to do that)

    2) Once you made sure the locale you want to use is available, go ahead with your code, but always do check also that the setting was successfull:
    ... my $loc = setlocale( LC_ALL, 'lt'); die 'Could not switch to locale "lt"!' if $loc ne 'lt'; ....
      Check what locale settings are installed/available to you in the working environment (perllocale suggests a number of ways to do that)

      I tried `locale -a' on Linux, and found out that it doesn't have 'lt', but it does have lt_LT, which should be the same. (It also has 'lithuanian' ... Is it a standard in Unix or some GNU extension?)

      Anyway, i tried running this:

      use strict; use warnings; use locale; use POSIX; my $loc = setlocale(LC_ALL, 'lt'); if (defined $loc) { print "loc is defined\n"; print "loc value: *$loc*\n"; } else { print "loc is undefined\n"; } my @sorttest = qw(ia ib ic ya yb yc); for (sort @sorttest) { print "$_\n"; }

      When i try 'lt_LT', $loc is 'lt_LT'. When i try 'lt', $loc is undefined, so i must be in the right direction.

      However, the sorting still doesn't work as i would expect Can it really be a bug in Linux?

      Also, perllocale suggests only Unix'ish ways to list available locales. Is there anything like `locale -a' on Windows? I'm trying to be portable.

Re: Sorting according to locale collation
by betterworld (Deacon) on Apr 22, 2007 at 12:00 UTC
    According to Lithuanian rules this should have printed:
    ia
    ya
    ib
    yb
    ic
    yc

    Hmm, I don't know why perl does not sort these correctly. But just out of curiosity: You said that "i" and "y" are treated the same. Would it still be right if you swap "ia" and "ya" in that list?

    The sort function, when not given a code block, uses the "cmp" operator, which does use the locale according to perlop. Does the Unix utility sort(1) behave correctly?

      just out of curiosity: You said that "i" and "y" are treated the same. Would it still be right if you swap "ia" and "ya" in that list?

      I'm not Lithuanian - i just studied it a little in the University. From what i've seen in dictionaries and grammar books, when the letter following I/Y is the same, I comes before Y.

      Does the Unix utility sort(1) behave correctly?

      I tried running this:

      [root@sugarcube loc]# LC_COLLATE="lt_LT" [root@sugarcube loc]# export LC_COLLATE [root@sugarcube loc]# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=lt_LT LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= [root@sugarcube loc]# cat ia.txt ia ic ib ya yb yc [root@sugarcube loc]# sort ia.txt ia ib ic ya yb yc

      Looks like sort(1) did something, but not what i expected. I am not sure that i changed the locale correctly - i am not a Unix export. Any help will be appreciated.

        Looks like sort(1) prints the lines in the same order as Perl's sort does. So I guess the problem is that the locale itself does not treat i and y the same. (I don't know if that's possible at all.)

        According to perldoc perllocale, the locale answers the question "which of these letters comes first". I don't think that the answer "neither i nor y comes first, but i comes first if it is the only difference in the whole word" is allowed.

        What is the output if you add say

        ha
        ja

        to your test data set ?

      Lithuanian dictionary must be wrong as "i" and "y" are not the same. Have they documented the rule, is it self consistent and do the dictionary entries match? If the answer to any is no then randomise the listing for .arts sorts.

      What is critical for collation is that any character position is monotonic.

      LC_ALL=C (or at least LC_COLLATE=C) is the only legal value. Any other value is known to break strcoll(). Better to use safer strcmp().

      It should always compare by character numerical value. I.e. either byte value (US-ASCII, ISO-8859) or possibly UTF code point. The byte at a time is simpler and won't break existing applications.

      With EBCDIC 1047 it will never be alphabetical order, but it will be in order and able to be bsearched. UTF-8 byte at a time will also produce odd, but consistent results.

      Please get rid of i18n and l10n from at least the curses screen and command line. Other charsets are okay so long as they don't break sort, look, etc. As for GUIs with internal UTF-16 host endian buffers I don't care so long as they read and write UTF-8 to the system.

Re: Sorting according to locale collation
by Krambambuli (Deacon) on Apr 22, 2007 at 16:38 UTC
    A side note, just in case working things out with 'locale' would turn out to be impossible/too difficult/unportable.

    I found two modules on CPAN that both seem to offer alternatives worthwhile to be considered:

    No::Sort and Cz::Sort look just like an invitation to add Lt::Sort :), providing help for sorts in Norwegian and Czech, whereas Sort::ArbBiLex seem to be a more general solution for getting sort behaving as you want.

    Hope that helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://611341]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2014-09-23 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (238 votes), past polls