Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Sorting (GRT) and locale issues

by chrism01 (Friar)
on Nov 15, 2007 at 04:35 UTC ( #650914=perlquestion: print w/ replies, xml ) Need Help??
chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I've been trying to learn a bit more about sorting, but was having problems with 'A brief tutorial on Perl's native sorting facilities' A brief tutorial on Perl's native sorting facilities. by BrowserUk.
The desc and examples looked just fine and I seemed to understand it, but couldn't reproduce it...
After some experimentation and inspiration I discovered the following:

From perldocs:

#!/usr/bin/perl -w use strict; use locale; print +(sort grep /\w/, map { chr } 0..255), "\n"; no locale; print +(sort grep /\w/, map { chr } 0..255), "\n";
produced this:
_0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

My test prog

#!/usr/bin/perl -w use strict; my ( @arr1 ); # GRT # Sort by num, then letter use locale; print "locale\n"; @arr1 = map{ unpack 'x[NA1]A*', $_ } sort map{ pack 'NA1 A*', substr( $_, 1 ), substr( $_, 0, 1 ), $_ } qw[ A473 B437 B659 C659 C123 D123 D222 E222 E001 A001 ]; print join("\n", @arr1)."\n\n"; no locale; print "no locale\n"; @arr1 = map{ unpack 'x[NA1]A*', $_ } sort map{ pack 'NA1 A*', substr( $_, 1 ), substr( $_, 0, 1 ), $_ } qw[ A473 B437 B659 C659 C123 D123 D222 E222 E001 A001 ]; print join("\n", @arr1)."\n";
Outputs:

locale A001 A473 B437 B659 C123 C659 D123 D222 E001 E222 no locale A001 E001 C123 D123 D222 E222 B437 A473 B659 C659

It seems to me that regardless of locale/no locale, it should sort nums before letters, as per BrowserUK's tutorial.
FYI:
perl v5.8.8, Linux openSUSE 10.2 (i586) 2.6.18.2-34-bigsmp

Also: env|grep -i lc has no output, set|grep -i lc gives:

MAILCHECK=60 MODLIST=($(LC_ALL=C $YAST -l| grep '^[a-z]' | grep -v "Availab +le")); done <<(LC_ALL=C $YAST $mod $prev help 2>&1); done <<(LC_ALL=C $YAST $mod help 2>&1); test_lc () for lc in LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY +LC_MESSAGES LC_PAPER LC_NAMELC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC +_IDENTIFICATION LC_ALL; eval val="\$$lc"; unset lc val; unset lc val;

Cheers
Chris

20071129 Janitored by Corion: Localized link

Comment on Sorting (GRT) and locale issues
Select or Download Code
Re: Sorting (GRT) and locale issues
by ikegami (Pope) on Nov 15, 2007 at 04:46 UTC

    It seems to me that regardless of locale/no locale, it should sort nums before letters, as per BrowserUK's tutorial.

    What numbers? You don't provide any numbers to sort. You first convert the numbers into strings of 4 characters. The resulting characters could be anything, including letters. sort is doing a lexical sort on them as requested, and applies locale rules as requested even though it makes no sense to do so.

    In this specific case, locale could very well affect how "µ" and "Ù" sort.

    my $NUL     = chr(0x00);
    my $SOH     = chr(0x01);
    my $STX     = chr(0x02);
    my $undef93 = chr(0x93);
    
    001: "${NUL}${NUL}${NUL}${SOH}"
    123: "${NUL}${NUL}${NUL}{"
    222: "${NUL}${NUL}${NUL}Þ"
    437: "${NUL}${NUL}${SOH}µ"
    473: "${NUL}${NUL}${SOH}Ù"
    659: "${NUL}${NUL}${STX}${undef93}"
    

    Since you want part of the string sorted numerically and part of the string sorted lexically (using the locale), you'll have to limit yourself to the ST.

    @arr = map{ $_->[0] } sort { $a->[1] <=> $b->[1] || $a->[2] cmp $b->[2] } map{ [ $_, substr( $_, 1 ), substr( $_, 0, 1 ) ] } qw[ A473 B437 B659 C659 C123 D123 D222 E222 E001 A001 ];
Re: Sorting (GRT) and locale issues
by chrism01 (Friar) on Nov 15, 2007 at 05:11 UTC
    Now I'm confused. According to BrowserUK:

    That way is to use pack to convert the numeric fields into binary values that will sort correctly using a string comparison function. It is convenient that binary encode integers (NOTE:In 'network' format only. That is 'N'&'n' *NOT* 'V'&'v') will sort correctly using a string comparison function.

    Are you saying that although I always use locale; to ensure a prog behaves correctly in the local locale (eg sorting), there's something special about GRT that means I have to turn it off?

    Incidentally, his results are sorted by num, then letter.

    Chris

      According to your locale, a byte with value 97 should be sorted before a byte with value 66.
      That's fine if you're sorting iso-latin-1 text. It means "a" will be sorted earlier than "B".
      That's not fine if you're sorting numbers in their native format.

      @a = map unpack('N', $_), sort map pack('N', $_), 65, 66, 67, 97, 98, 99; => 97, 65, 98, 66, 99, 67 using "use locale" on your system. a A b B c C => 65, 66, 67, 97, 98, 99 using "no locale". A B C a b c

      So yeah, you'd have to turn it off when it's comparing bytes representing numbers. But since you want it on for when it's comparing bytes representing text, you have a problem.

Re: Sorting (GRT) and locale issues
by chrism01 (Friar) on Nov 15, 2007 at 06:14 UTC
    I was going to ask if looks like there's something wrong with my locale: it should be Australia (ideally), prob defaults to US, but in either case

    substr( $_, 1 ), substr( $_, 0, 1 ), $_ } on A437 should give
    437 A A437
    ie the nums are at the front regardless. The top code just shows that lower/uppercase is mixed or upper then lower, but nums are always first...

    Sorry if it seems I'm being obtuse; I REALLY want to understand this.

    Chris

      You don't pass '437 A A437', you pass the result of pack('N A1 A*', '437', 'A', 'A437').
      use Data::Dumper qw( Dumper ); for (qw[ A066 A067 A098 A099 ]) { my $coded = pack('N A1 A*', substr($_, 1), substr($_, 0, 1), $_); local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; print("$_: ", Dumper($coded)); }
      A066: "\0\0\0BAA066" A067: "\0\0\0CAA067" A098: "\0\0\0bAA098" A099: "\0\0\0cAA099"

      PS - Why do you keep replying to your node?

Re: Sorting (GRT) and locale issues
by salva (Monsignor) on Nov 15, 2007 at 09:22 UTC
    Instead of the GRT, you can use the module Sort::Key::Multi that is faster, much easier to use and supports locale-aware sorting:
    use Sort::Key::Multi qw(is_keysort il_keysort); # 'is_' and 'il_' indicate the sorting key types, # in that case: # is = integer + string # il = integer + locale_string my @sorted = is_keysort { (/(.)(.*)/)[1,0] } @data; my @locale_sorted = il_keysort { (/(.)(.*)/)[1,0] } @data;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://650914]
Approved by randyk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-07-13 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (249 votes), past polls