Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Understanding transformation sorts (ST, GRT), the details

by 5mi11er (Deacon)
on May 17, 2005 at 20:27 UTC ( #457993=perltutorial: print w/ replies, xml ) Need Help??

Let me start by saying that while I understood in theory what was happening in the Schwartzian and Guttman Rosler Transformations, I had a difficult time understanding the the details of how the magic "map sort map" incantations actually worked.

I had read the requisite articles scattered around the Monastery, especially Advanced Sorting - GRT - Guttman Rosler Transform and Choosing the right sort (Or: WUDsamatter?) and fast, flexible, stable sort. From tye's reputation, and his description in this last node, I'd figured he touched upon something brilliant, but I certainly didn't "get it".

I copied the snippet, filled in the XFORM function, and ran the code manually through the debugger printing what various parts of the sort transformation did, and finally "got it".

Now, I'm pretty intelligent, so I figured that if I hadn't gotten it after reading bunches of the "sort" technique articles around the catacombs, that it would probably help a number of others if I could explain the magic a bit so they might understand what was going on with these transformation techniques.

The basic theory

We take the data we want to sort some particular way, and figure out how to mangle that data such that a standard Perl ascii sort will do the job for us, then unmangle the data such that we get back our original information, except now in sorted order. If you understand this, and have inklings about how to transform the data, but don't understand the details about how it all gets put together, this node is intended specifically for you.

How this Transformation works

As is true for all transformations, it helps a great deal to first understand that everything within this construct: (directly from Tye's node above)
my @sorted= @list[ map { unpack "N", substr($_,-4) } sort map { XFORM($list[$_]) . pack "N", $_ } 0..$#list ];
is layed out in reverse order from which the execution is done. So, start at the end, you'll see the 0..$#list, which is just a list of all the indexes within the array @list. The indexes are what we're iterating over, and $_ is set to each index in turn.

Next we have :map { XFORM($list[$_]) . pack "N", $_ } which simply calls XFORM(), which is your transformation of the data into a sortable string, and then appends to that string an encoding of the index value packed (in an architecturally independant format) into 4 bytes. That's the pack "N", $_ portion. The map statement just says to do that for every index.

The resulting array of transformed, stringified data with the index values packed and glued to the end of each string is then passed to the sort routine. Assuming your transformation works as it is supposed to, the correctly sorted array of transformed, stringified data is then passed to the first map statement. This map just grabs the last 4 bytes of each string, which must be our encoded index value from the original array, and decodes it back into a usable index value; and it does this for all the entries. To be completely clear, the transformed, stringified stuff is thrown away at this point; only the index values are left.

What we have at this point is an array of index values from the original array, but these index values are in the desired sorted order! At this point we've simplified the original transformation into this: my @sorted=@list[@array_of_sorted_indexes]; Execute that and we finally get @sorted which has the original array's information, but in sorted order, and @list is unmodified.

If you're fuzzy about the @list[@array] notation, this is perfectly legal. Try this in a debugging session:
print @{[a..z]}[0,4,8,14,20]; and you'll get the list 'a,e,i,o,u' back.

If all that was still not clear enough and you need a concrete example with the sort magic broken out into individual statements, the following sorts dotted IP address information:

#!/usr/bin/perl use strict; use warnings; sub XFORM { # Extract the sort key from $_[0] and return it. # This will often be written in-line # rather than as a real subroutine. return sprintf("%012.0f", ip2val($_[0])); } my ($addstr,$value,@addstrary,@addvalary,@sorted,$DEBUG); { no warnings 'numeric'; if($ARGV[0] > 0) { $DEBUG = shift @ARGV; } else { $DEBUG = 0; } } while (<>) { s/(^\s+)//; #remove indentation s/\s+[\n\r]+//g; #remove all eol spaces/tabs & ctrl chars if(/^$/) { next; } #ignore blank lines if(/^#/) { next; } #ignore comments ($addstr) = Extract_IP_Address($_); if($addstr ne '') { #Use this to check the XFORM routine #printf "%15.15s %s\n",$addstr,XFORM($addstr); push @addstrary, $addstr; } } if($DEBUG == 0) { @sorted= @addstrary[ map { unpack "N", substr($_,-4) } sort map { XFORM($addstrary[$_]) . pack "N", $_ } 0..$#addstrary ]; } else { #split up Tye's GRT sort above into separate pieces my (@arytosort, @sorted_stuff, @index_info); @arytosort = map { XFORM($addstrary[$_]) . pack "N", $_ } 0..$#add +strary; @sorted_stuff = sort @arytosort; @index_info = map { unpack "N", substr($_,-4) } @sorted_stuff; @sorted = @addstrary[@index_info]; } foreach my $str (@sorted) { print "$str\n"; } #---- Misc. IP address routines ---- ########################################### # convert 4 byte integer value to an IP address ########################################### sub val2ip { my ($a,$b,$c,$d)=unpack("CCCC",pack("N",$_[0])); return "$a.$b.$c.$d"; } ########################################### # convert IP address to a 4 byte integer value ########################################### sub ip2val { if ( $_[0]=~ /([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/o ){ return (unpack("N",pack("CCCC",$1,$2,$3,$4)) ); } else { return 0; } } ##################################################################### # Find the first dotted ip address in any string, puts it in $1, # and put the individual octets in $2, $3, $4 & $5. Figure out the # long integer representation of that address, and Return everything # in a list. # # NOTE: it doesn't check to make sure each octet is between 0-254, # nor if the numbers between the dots are 1 to 3 characters in lengt +h. # # Example: given "the IP address of www.xxx.com is 206.24.105.142" # return the list ('206.24.105.142',206,24,105,142,3457706382) ##################################################################### sub Extract_IP_Address { if ( $_[0]=~ /(([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+))/o ){ return ($1,$2,$3,$4,$5, unpack("N",pack("CCCC",$2,$3,$4,$5)) ) +; } else { return (); } }
It'll read from an input file or Standard In, and grab the first IP address on each line. If you want to manually run through the split up sort steps within a debugger, the script's first argument should be any positive number. Or just set $DEBUG yourself while in the debugger...

-Scott

Comment on Understanding transformation sorts (ST, GRT), the details
Select or Download Code
Re: Understanding transformation sorts (ST, GRT), the details
by merlyn (Sage) on May 17, 2005 at 22:32 UTC
    As I was reading through the article, I was anxiously awaiting the appearance of the Schwartzian Transform.

    But alas, even though you mention it in the title, and once in the body, you never got to it. Just the GRT, and Tye's particular trick.

    I still think that the ST is always clearer than the GRT, and that jumping to the GRT without careful consideration is a case of premature optimization—except for those rare cases where the work required to wiggle your data into a single sortable string isn't all that unobvious.

    By contrast, here's "sort a list of IP addresses" using the ST:

    my @sorted = map $_->[0], sort { $a->[1] <=> $b->[1] or $a->[2] <=> $b->[2] or $a->[3] <=> $b->[3] or $a->[4] <=> $b->[4] } map [$_, split /\./], @inputs;
    Quite a bit shorter than your code, and far less trickery and magic.

    So, it's too bad your title is a letdown. Good analysis, otherwise.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Ah, yes, you're right. When I started, I had intended to better show that understanding this one instance, which is a specific form of the GRT, which is in turn a specialization of the Schwartzian Transform, it's fairly easy to generalize the understanding to cover the more general cases.

      But, I never quite made it that far.

      Thanks for the vote of confidence anyway!

      Update: Let's attempt to rectify this situation.

      The analysis on the root node is a very specific, sort transformation technique. I've only just scratched the surface, but I believe this technique is very powerful and should be useful for nearly any sorting job. However, it's not the only technique out there. I think that by understanding this technique, you should be able to easily generalize this to better understand the other transformations.

      Guttman Rosler Transformations

      The next step up are the Guttman Rosler Transformations. The core idea of the GRT is to use the built in perl ascii sorting. In order to use this, we need to create stringified representations of our data first, then sort, then get the original data back. In a lot of cases, the map functions do a lot of work to put the data in the correct form, and then undo that work after the sort to get the data back. An example of this can be found at this node by jdporter.
      my @files = <DATA>; chomp @files; @files = map { substr $_, 10 } sort map { sprintf "%10s%s", /(\d+\w{3}\d{4})/, $_ } @files; print join "\n", @files, ''; __DATA__ fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.1Mar2005.gz fwlog.2Mar2005.gz fwlog.20Mar2005.gz
      Slightly modified to replace original @a array with more descriptive @files array

      In this case the OP wanted to sort the files by date. So, the above code took the array of text data in @files, simply grabbed the date string as-is, and appended that onto the front of the original string, sorted, and then extracted the original string. At first glance the map after the sort (the encoding map) doesn't do enough work to sort properly since it doesn't appear to align the days when dealing with single digits. However, it works because the sprintf("%10s"...) right justifies the text which does align the dates.

      Schwarzian Transformations

      In the same thread, there are multiple versions of a Schwartzian Transformation. A Schwartzian Transformation generally creates an array of 'stuff'; part of that array is the original data, other parts are the pieces to sort against. And the call back sorting routine is specified. Probably the best ST answer given is this one by RazorbladeBidet.
      my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ); my %months; my $i = 1; $months{$_} = sprintf( "%02d", $i++ ) for @months; print $_, "\n" for map { $_->[0] } sort { $a->[1] <=> $b->[1] or $a->[2] <=> $b->[2] or $a->[3] <=> $b->[3] } map { (split(/\./,$_,3))[1] =~ /^(\d+)([A-Za-z]+)(\d+)$/; [$_, $3, $months{$2}, sprintf("%02d", $1) ] } @files;
      This is slightly modified from the original to show the intent of the original author. The original had sprintf("%02d", $3), or against the 4 digit year, which ends up to not do anything. I believe he'd intended to use that on the date itself.

      Here, he goes through the trouble to map the month abreviations to numbers via a generated hash, and makes sure to sort by year then month then day. Let's see how this works.

      Again the map statement after the sort is executed first, and it looks pretty complicated. The split is matching the periods in the filenames, and is only keeping the 2nd part which is the date information. It then matches and returns the individual parts of the date in $1, $2 and $3 corresponding to the day, month, year. Then, an array of 4 items is created, the original string, the year, the month in numeric format, and finally the date with a leading zero if date is a single digit. This is done for every member of the @files array.

      The resulting array of arrays is sent to the sort routine which uses the supplied call back function to compare first the year, then the numeric version of the month, and finally the date.

      The map function before the sort then gets called to pull out only the original version of the data, which is then printed rather than stored.

      Conclusion

      Hopefully between the original node, and this subsequent analysis of the other transformations, this has given you enough understanding to use these powerful sorting techniques for your own needs.

      -Scott

        (split(/\./,$_,3))[1] =~ /^(\d+)([A-Za-z]+)(\d+)$/; [$_, $3, $months{$2}, sprintf("%02d", $1) ]
        I guess this is my day to point out bad uses of $1 in the first stage of a ST. {grin}

        So, to repeat what I said there...

        You are using $1, $2, $3 without testing the success of the match. This means that you might possibly be getting the previous round's data, resulting in duplicated output. What you should be doing instead is skipping over the erroneous entries, or perhaps dieing. Just to be different from the previous node, let's do the die thing:

        (split(/\./,$_,3))[1] =~ /^(\d+)([A-Za-z]+)(\d+)$/ or die "improperly + formed data: $_"; [$_, $3, $months{$2}, sprintf("%02d", $1) ]

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Understanding transformation sorts (ST, GRT), the details
by TStanley (Canon) on May 18, 2005 at 01:10 UTC
    Data Munging with Perl gives a very good example of both the ST and the GRT and does a good job of breaking them both down

    TStanley
    --------
    The only thing necessary for the triumph of evil is for good men to do nothing -- Edmund Burke
Re: Understanding transformation sorts (ST, GRT), the details
by jdporter (Canon) on May 18, 2005 at 13:30 UTC

    I agree with merlyn that it's better to try to grok the ST before the GRT, which, after all, is a special case of the ST. I'm not sure that the ST is always clearer, however.

    Interesting that the example of sorting IP addresses is used here, because that was the original scenario in which the GRT was first explored (by Michal Rutka, Uri Guttman, Larry Rosler and myself). We could probably say that it is a canonical example: The data itself is very simple, the string representation cannot be (usefully) used for sorting, and the transformation to and from a meaningfully sortable representation is trivial.

    @ips = # note: this converts any "unusual" reps to dotted-quad map { inet_ntoa $_ } sort map { inet_aton $_ } @ips;
    Or, to sort lines which have IPs embedded:
    @ips = map { substr $_, 4 } sort map { /(\d[.\d]+\d)/; inet_aton($1) . $_ } @ips;

    As you can see, this is no less clear than the equivalent ST. The maps are arguably simpler; and of course the sort is more trivial, since that's always the case.

    But sorting IPs is something of a special case. In general, one would probably have to resort to either pack or sprintf to generate the sortable string.

        I'm assuming well-formed input. If there's non-conformant data in the input, it would be better to have already detected and handled it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://457993]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-09-30 19:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (381 votes), past polls