http://www.perlmonks.org?node_id=483462

zzspectrez has asked for the wisdom of the Perl Monks concerning the following question:

So I need to handle sorting some strings that contain numbers. Instead of rolling my own, I did a super search to see what others have done and found this example by tye. Just a few lines of code and does exactly what I need. Just could not understand what the heck it was doing at first glance. So I decided to deconstruct it and ask you all if I got it right.

Lets analyze the code:

my @list = <DATA>;

Read in the data to the unsorted list. Ok, this part is easy. :)

my @sorted = @list[ map { unpack "N", substr($_,-4) } sort map { my $key = $list[$_]; $key =~ s[(\d+)][ pack "N", $1 ]ge; $key . pack "N", $_ } 0..$#list ];

Ok this is not as simple to follow.. Lets break it down.

my @sorted = @list[ ... ];

Ok, we are assigning an array slice to the @sorted array. So if the code between the brackets returned "1,5,2,3" then we the sorted array would be assigned $list[1], $list[5], $list[2], $list[3] in that order. The code inside the brackets returns the order which is assigned using an array slice.

map { my $key = $list[$_]; $key =~ s[(\d+)][ pack "N", $1 ]ge; $key . pack "N", $_ } 0..$#list

What the??? Ok.. lets see.. First lets look at what the map is doing:

map { ... } 0..$#list

Ok, The map is just itterating over the number 0 .. to the subscript of the last element in the array @list. So if there are 5 elements of in the array map will be getting 0 .. 4. So why are we not just doing for $a = ( 0 .. $#list)? That is because we want map to return a modified list to be sorted.

my $key = $list[$_] $key =~ s[(\d+)][ pack "N", $1 ]ge;

Now what is going on inside the map? Here we are setting $key to a value from the @list. We then use a greedy regex to globaly replace all numbers with the number packed as a 32 bit Big Endian Number. So the string "Happy4" would be converted to "Happy" . chr(00) . chr(00). chr(00) . chr(04). The modifier to the regex "e" evaluates the right side of the substitution as perl code.


$key . pack "N", $_

Now we are taking the converted string and adding that elements position in the unsorted array as a packed 32 bit Big Endian Number and returning this value from the map.

This results in a list of the unsorted array with all number converted to 32 bit Big Endian Format and the position tacked on to the last 4 bytes. This list is returned to the builtin sort function which can then sort the resulting array asciibetically. This sorts how we think it should.

Given the strings "apple1", "apple2", "apple3","apple20". A normal sort would return the results in the order:

  1. apple1
  2. apple2
  3. apple20
  4. apple3

With the conversion from this routine, the strings are converted to:

  1. "apple" . chr(00) . chr(00) . chr(00) . chr(1)
  2. "apple" . chr(00) . chr(00) . chr(00) . chr(2)
  3. "apple" . chr(00) . chr(00) . chr(00) . chr(3)
  4. "apple" . chr(00) . chr(00) . chr(00) . chr(20)

Because of this conversion, the strings sort properly.

Ok now for the last bit of code.

map { unpack "N", substr($_,-4) }

What this does is takes the last 4 chars of each string which is the the position in the unsorted array packed in Big Endian Format and unpacks it and returns it to be used as an array slice.

So given the following

@list = ( 'apple01', 'apple10', 'apple20', 'apple3', 'apple4');

all the following sorts and maps result in a statement that would simplify as

 my @sorted = @list[ 0, 3, 4, 1, 2];

Pretty cool. ++tye!!

So only limitations I can see, would it could only sort a list up to 4,294,967,295 strings.. Not to bad.

Anything you think Im missing.. Or any reason not to use this for sorting these types of strings? Performance issues? Bugs? etc.. You still using it tye?? :)

Update
Also found this module: Sort::Naturally. Which looking at the source is quite a bit of code.

zzSPECTREz

Replies are listed 'Best First'.
Re: [ Natural Sort ]: Sorting a string with numbers (limitations)
by tye (Sage) on Aug 13, 2005 at 00:05 UTC

    It doesn't "properly" sort negative numbers, non-fixed decimal values, nor integers larger than 2^32-1. It is very fast as far as these types of things go. Yes, I still use this.

    Nice deconstruction. (:

    - tye        

Re: [ Natural Sort ]: Sorting a string with numbers
by hv (Prior) on Aug 13, 2005 at 00:42 UTC

    One small additional facet: because the (packed) index is appended to the key, this will gave you a stable sort even if the underlying sort is unstable.

    See sort for more details about what that means, and when it might be relevant.

    Hugo

      Usually... (as bart pointed out in fast, flexible, stable sort)

      If you have fewer than 16.7 million records and none of the records contain "\0" characters, then it always works, for example.

      - tye        

Re: [ Natural Sort ]: Sorting a string with numbers
by 5mi11er (Deacon) on Aug 15, 2005 at 15:37 UTC
    Good write up zzSPECTREz.

    Tooting my own horn, a similar write up of this type of technique can also be found at this tutorial which also includes several links to other good sorting information found within the Monestary.

    -Scott

Re: [ Natural Sort ]: Sorting a string with numbers
by blueflashlight (Pilgrim) on Sep 15, 2005 at 18:31 UTC
    I love this site! I was writing my program, had a need for a routine that would do natural sorting, came right to the Monestary and Super Searched for "natural sort", and here I found this code, which worked almost perfectly!

    Here is the one problem I have with it, and though I've read the write-up and follow-ups, and I think I get how it works, I can't seem to figure out a solution. The problem is this:

    If the unsorted members of the list are something like ...
    @list = ( 'apple1', 'apple', 'apple20' );

    ... then this routine will result in a sorted order of ...
    apple1
    apple20
    apple

    ... but what I would expect (and want) would be ...
    apple
    apple1
    apple20

    I thought about going through my list beforehand, and appending something like "00000" to each member that didn't end in a numeric value, and then stripping the "00000" off after the sort, but that seems like a really stupid thing to do.

    Can someone help me out please? Thanks!
      Did you do something wrong? I got what you expected:
      #!/your/perl/here use strict; use warnings; my @list = qw(apple1 apple apple20); my @sorted = @list[ map { unpack "N", substr($_,-4) } sort map { my $key = $list[$_]; $key =~ s[(\d+)][ pack "N", $1 ]ge; $key . pack "N", $_ } 0..$#list ]; print "@sorted\n";
      gives:
      apple apple1 apple20
      Post your code and let's check it.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        The whole program is a bit complex and does a bunch of stuff that isn't relavent to this issue, but I've cut out everything but this section that shows the problem I'm having.
        #!/usr/bin/perl -w use strict; my @h = ( qw( psapp1 psapp psapp2 psepapp1 psepapp psepapp2 psimgprod pspappuat pswebapp2 pswebapp ) ); my @h_sorted = @h[ map { unpack "N", substr($_,-4) } sort map { my $key = $h[$_]; $key =~ s[(\d+)][ pack "N", $1 ]ge; $key . pack "N", $_ } 0..$#h ]; foreach (@h_sorted) { print "$_\n"; };

        When I run this program, with my Perl (This is perl, v5.8.6 built for darwin-thread-multi-2level), I get:
        psapp psapp1 psapp2 psepapp1 psepapp2 psepapp psimgprod pspappuat pswebapp2 pswebapp
        The "psapp" items sort properly, but the "psepapp" and "pswebapp" ones show the problem I'm asking about. Thanks! -s-