Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Spanish locale and name sorting

by Jorge_de_Burgos (Beadle)
on May 01, 2009 at 15:03 UTC ( #761305=perlquestion: print w/replies, xml ) Need Help??

Jorge_de_Burgos has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. I wonder if any of you has tried to sort a list of names under a Spanish locale system. I find the output puzzling, and you can't blame it on my knowledge of Spanish since that's my native tongue.

I mean, whereas this all-English program:

#!/usr/bin/perl my @list = ('maceira', 'mac alister', 'mac loughlin', 'san esteban', ' +sangregorio', 'san zoilo'); print "$_\n" for sort @list;

outputs this (as expected in the default English perl dialect):

mac alister mac loughlin maceira san esteban san zoilo sangregorio
this supposedly Spanified program:
#!/usr/bin/perl use locale; my @list = ('maceira', 'mac alister', 'mac loughlin', 'san esteban', ' +sangregorio', 'san zoilo'); print "$_\n" for sort @list;
outputs this non-Spanish order:
mac alister maceira mac loughlin san esteban sangregorio san zoilo

Now please keep in mind that:

  1. I haven't used any high ASCII character in the list.
  2. My system locale is "es_AR.UTF-8".
  3. We Spanish-speaking people would expect the sorted list to come out with all "mac something" and all "san something" before "maceira" and "sangregorio" respectively, making these longer words come out after surnames whose first word is shorter ("mac" and "san").

Replies are listed 'Best First'.
Re: Spanish locale and name sorting
by spectre9 (Beadle) on May 01, 2009 at 19:08 UTC

    I'm no expert, but after reading perllocale it seems that checking the LANG environmetal variable might relevant to your issue... It is possible to override the language selection of a locale after the fact within many Unix systems. Sometimes I've seen an LC_LANG variable as well, or NLS_LANG (I think Oracle uses NLS_ variables for instance)

    Might you examine your %ENV{} in the running Perl program and report back the LC_ variables? And anything that contains lang? These usually are locale related if my memory serves.

    Here's some quick code that might work -- reply back with the output if your still 'bugged' by the sort order and I'll be happy to take a closer look. I know Spanish (Madrid, Spain) somewhat well and understand what your trying to achieve.

    foreach my $env (keys %ENV){ if($env =~ /^LC_/i || $env =~ /lang/i){ print "$env => $ENV{$env}\n"; } }

    "Strictly speaking, there are no enlightened people, there is only enlightened activity." -- Shunryu Suzuki
      Right away to your suggestion, spectr9. This is my output from your routine:
      LANG => es_AR.UTF-8 GDM_LANG => es_AR.UTF-8
      That's the correct locale for my system here in Buenos Aires. (I wouldn't like to change that by the way.)
Re: Spanish locale and name sorting
by Anonymous Monk on May 02, 2009 at 01:30 UTC
    We Spanish-speaking people would expect the sorted list to come out with all "mac something" and all "san something" before "maceira" and "sangregorio" respectively, making these longer words come out after surnames whose first word is shorter ("mac" and "san").

    Maybe you expect wrong thing?

    #!/usr/bin/perl -- use strict; use warnings; localsort("C"); localsort("Spanish - Argentina"); sub localsort { use POSIX qw(setlocale LC_CTYPE); my( $wantlocale ) = @_; my $curlocale = setlocale(LC_CTYPE); my $setlocale = setlocale(LC_CTYPE,$wantlocale); if( not $setlocale ){ print "Couldn't switch locale from ($curlocale) to ($wantlocal +e).\n"; } else { print "Current locale is ($setlocale).\n"; my @list = ('maceira', 'mac alister', 'mac loughlin', 'san esteban', 'sangregorio', 'san zoilo'); my @yes = do { use locale; sort @list; }; my @no = do { no locale; sort @list; }; printf " %-20s %-20s %-20s\n", qw[unsorted use-locale no-lo +cale ]; print '- ' x 33,"\n"; for my $i( 0 .. $#list ){ printf "%3d %-20s %-20s %-20s\n", $i, $list[$i], $yes[$i], + $no[$i]; } } print '- ' x 33,"\n"; setlocale(LC_CTYPE,$curlocale);#restore } __END__ Current locale is (C). unsorted use-locale no-locale - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 maceira mac alister mac alister 1 mac alister mac loughlin mac loughlin 2 mac loughlin maceira maceira 3 san esteban san esteban san esteban 4 sangregorio san zoilo san zoilo 5 san zoilo sangregorio sangregorio - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Current locale is (Spanish_Spain.1252). unsorted use-locale no-locale - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 maceira mac alister mac alister 1 mac alister mac loughlin mac loughlin 2 mac loughlin maceira maceira 3 san esteban san esteban san esteban 4 sangregorio san zoilo san zoilo 5 san zoilo sangregorio sangregorio - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Maybe you expect wrong thing?

      Why would you say that? The output of your code on your system shows that our expectations are right -- if you use some 1252 (I think that means Windows) locale instead of UTF-8.

      This is the output of your program on my system.

      Current locale is (C). unsorted use-locale no-locale - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0 maceira mac alister mac alister 1 mac alister maceira mac loughlin 2 mac loughlin mac loughlin maceira 3 san esteban san esteban san esteban 4 sangregorio sangregorio san zoilo 5 san zoilo san zoilo sangregorio - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Couldn't switch locale from (es_AR.UTF-8) to (Spanish - Argentina). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      I am looking for a solution to a problem that arises under Spanish UTF-8 locales, where sorting order treats the space character as non existent.

        Down and dirty hacks are available of course. For everyday use I have come up with this:

        #!/usr/bin/perl use locale; my @list = ('maceira', 'mac alister', 'mac loughlin', 'san esteban', ' +sangregorio', 'san zoilo'); sub keeping_spaces { my $aa = $a; my $bb = $b; for ($aa) { tr/ /A/; } for ($bb) { tr/ /A/; } return $aa cmp $bb; } print "$_\n" for sort keeping_spaces @list;

        Which outputs what we would expect:

        mac alister mac loughlin maceira san esteban san zoilo sangregorio
        Why would you say that?
        Because I don't get the results you expect :) But then I don't have es_AR.UTF-8. Your results column for use-locale seems to ignores setlocale (because it doesn't match mine), but your no-locale column matches mine. I suspect a bug in locale. Can you try again with "es_AR.UTF-8" instead of "Spanish - Argentina"?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://761305]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2023-02-02 08:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (16 votes). Check out past polls.

    Notices?