Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

fast, flexible, stable sort

by tye (Sage)
on Aug 27, 2003 at 19:56 UTC ( [id://287149]=CUFP: print w/replies, xml ) Need Help??

I was building sort keys such that a simple strcmp()/memcmp() could be used as the comparison function in a complex sort even before I found Perl (which was just before Perl 4 was released).

It turns out that this technique is very handy in Perl because it usually makes for the fastest possible sort by avoiding specifying a comparison function:

my @sorted= map { RESTORE($_) } sort map { XFORM($_) } @list;

The main problem with the technique is that you need to build the key (XFORM) in such a way that the original record can be reconstructed from it (RESTORE).

That often makes this technique difficult to use, making it more of a specialized technique, not something you'd feel you could use all the time.

I'd previously used techniques that avoid this RESTORE() step but they still weren't general enough or elegant enough to be the one way to sort for me.

I finally came up with a technique that I suspect will be "the one way to sort" for me (below).

It also has the feature that it produces a "stable" sort, that is, records with identical keys are kept in their original order relative to each other.

sub XFORM { # Extract the sort key from $_[0] and return it. # This will often be written in-line # rather than as a real subroutine. } my @sorted= @list[ map { unpack "N", substr($_,-4) } sort map { XFORM($list[$_]) . pack "N", $_ } 0..$#list ];

If you want to sort parallel lists, then you'd keep the sorted list of indices around:

my @index= map { unpack "N", substr($_,-4) } sort map { XFORM($name[$_]) . pack "N", $_ } 0..$#name; @name= @name[@index]; @data= @data[@index];

Replies are listed 'Best First'.
Re: fast, flexible, stable sort
by bart (Canon) on Feb 12, 2004 at 20:46 UTC
    It doesn't work. Well, it does work, but only if the XFORM routine returns a string of the same length for every item. Otherwise, the sorting could turn out wrong.

    I've made a very contrived example, containing lots of null bytes, but actually, if you have a sufficiently large amount of array items, you can get the same effect on other bytes as well.

    Let me demonstrate the effect by sorting a number of variable length strings as is, and with the packed index appended. As is shown, the sorted results are not in the same order at all.

    use Data::Dumper; $Data::Dumper::Useqq = 1; my @data = map "\0" x $_, 0 .. 5; print Dumper [ sort @data ], [ sort map pack("a*N", $data[$_], $_), 0 .. $#data ];
    Result:
    $VAR1 = [ "", "\0", "\0\0", "\0\0\0", "\0\0\0\0", "\0\0\0\0\0" ]; $VAR2 = [ "\0\0\0\0", "\0\0\0\0\0\0\0\0\5", "\0\0\0\0\0\0\0\4", "\0\0\0\0\0\0\3", "\0\0\0\0\0\2", "\0\0\0\0\1" ];
      It doesn't work.

      Really?

      Well, it does work,

      Oh...

      but only if the XFORM routine returns a string of the same length for every item.

      "only"? (:

      Actually, it always works if the fields are fixed-length. It also always works if the fields don't contain "\0" characters and you don't have more than 16 million records. Those two cases cover almost every sort I do.

      It often works when these guarentees don't apply.

      It is also fairly easy to fix it so it always works even if you have fields with lots of trailing null bytes. For example, a s#([\00\01])#\01$1#g and join "\0", is enough.

      I noticed the potential for this problem quite a while ago and hoped to address it in the module based on this idea, but working on such a module hasn't made it to the top of my list yet. Thanks for motivating me to address the problem here. :)

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://287149]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-19 10:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found