Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Emulating 'sort' command through a Perl

by merlyn (Sage)
on Oct 29, 2009 at 07:12 UTC ( #803896=note: print w/replies, xml ) Need Help??

in reply to Emulating 'sort' command through a Perl

my $column_number = 3; # 0-origin based my $prev = ""; for ( map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [$_, (split)[$column_number]] } <> ) { print unless $_ eq $prev; $prev = $_; }

-- Randal L. Schwartz, Perl hacker

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Replies are listed 'Best First'.
Re^2: Emulating 'sort' command through a Perl
by paragkalra (Scribe) on Oct 29, 2009 at 20:06 UTC

    Hi Randal,

    Thanks a bunch..

    Today I have attained the "Nirvana" because of you.

    My primary interests being Electronics, Physics, Movies, Novels & Linux, I generally don't read books on software engineering. Most of the times I do Googly.

    However your's 'Learning Perl' is the only computer related book which I have read in my life. I just finished it few weeks back.

    And I am not exaggerating it but I finished it like a Novel. It was so easy and simple to understand.

    Its because of your book I have started loving & enjoying Perl. Great Job Randal!!! I do owe many things to you...

    I always wanted to meet you and with you replying to my post, my "Karma" has been blessed with gracious bliss....Lucky me...

    Coming to your piece of code....It worked like a breeze....All I can say is that - I am speechless

    But being frank & forthright, I couldn't understand it much :( . This is the most strangest & strongest Perl code I have ever seen with hardly any use of semicolons.

    My head is spinning between sort & map operators. Also I was not knowing 'cmp' operator as I had skip the section 13.4 of your book... :(

    I have never used the split operator in the fashion which you have used. I am not sure what '$_->[0], $->1 are doing'. I think they have something to do with references.

    If somebody can walk me through the above lines - explaining through one liner comments, I would be grateful...

      paragkalra, merlyn has demo'ed some advanced stuff. The technique he used is called a Schwartzian Transform and of course is named after him. This is a performance enhancement (not a functional enhancement) over less complex techniques.

      You asked: If somebody can walk me through the above lines.....

      You seem interested and I'm gonna give you some hints, but first I think you would be well advised to learn more about basic sorting. The following code is a "simple" sort and will be slower than merlyn's code as the number of lines get larger but it is functionally the same. I'll walk you through this and then the more advanced code will make more sense.

      #!/usr/bin/perl -w use strict; my $column_number =3; my $prev = ""; my @input_lines = <DATA>; #slurps DATA in all at once @input_lines = sort {my ($a_col) = (split(/\s+/,$a))[$column_number]; my ($b_col) = (split(/\s+/,$b))[$column_number]; $a_col cmp $b_col }@input_lines; foreach (@input_lines) { print unless $_ eq $prev; $prev = $_; } __DATA__ FT9mWp d4fgMB gvZRJU XRRu0N X 4ewejk pFnjd4 ie0hZF pPipQJ A 4ewejk 4sqprx ie0hZF cqtexi 3 Fo1OKn qhZPvb qWZPrt ruBObs 2
      First we want to get all the data lines. This can be done like above or with a while{} loop and pushing input lines onto the @inputlines array.

      Now we come to the sort. An important thing is to understand that what goes into a sort{} is what comes out! We can sort all kinds of complex structures. But what goes in, is what comes out is true. Here I assigned the output back to the input, which is completely legal and common.

      Sort works on some "magic variables", $a and $b. The sort function compares pairs of "things", whatever they are that are coming in. And $a and $b are assigned to those "thing pairs". Here we have just lines of text input. But this could be something like rows of an Array of rows (an ArrayofArray) or even other "things".

      @inputlines is what comes in (and is what is going to come out of the sort), but we have a problem because we only want to sort the input lines based on a subset of the line, the thing in the specified column. To get the right column, we use split() which generates a list of tokens that were separated by whitespace. Now only one of these columns is relevant. So this (split(/\s+/,$a))[$column_number] stuff is a "slice" and says that I only want the one of the things from the split() operation - the stuff in a particular $column_number. That is done and assigned twice once for the $a input and once for $b input. Then we do a "cmp" which is an alpha comparison yielding a "less than, equal, greater than" value. The sort{} function uses this decision to order the inputlines. Remember what I said before, what goes into a sort comes out of a sort, so this decision affects the inputlines, not just this particular column!!

      Then results are printed. Lines that are the same as previous lines are suppressed.

      Now as you see if you're following me so far, every time I want to compare 2 input lines, I have to go through this split() and then slice stuff. What merlyn's code does is precompute these comparison values once and then use these precomputed results in the sort. The sort doesn't do any fewer compares, but it spends less time figuring out exactly what to compare! The "cost" is building an intermediate data structure with those values (which takes time and memory).

      The Schwartzian Transform creates a new ArrayofArray, with each "row" containing the original thing and then the things that are to be used in the sort comparison function. That array is the "thing" that gets inputted into the sort. After the sort is done, we don't need these precomputed values any more, so the original input is re-constructed, ie the precomputed things are thrown away.

      Note that these map{},sort{}grep,map{} things process lists. Because the "line" could get way long and it is easier to read, they are usually "stacked", read this stack from the bottom up, actually right to left.

      Sorry post got long, but I hope this helps you understand at least the general idea of what is going on. I would recommend practicing more along the lines of my example before launching off into more complex techniques.

        Thanks Marshall for sharing your time & your wonderful notes...

      I think I have found 1 bug in the code.

      Suppose I am sorting a text file containing following rows by column 3 i.e numeric 2:





      The code should give the output:





      But it is giving:





      Difference can be found out through 2nd & 3rd cell values of 2nd column i.e numeric 1 of both the outputs.

        I think you are confused with column numbering. In Perl, the first column is number 0 and the second is number 1, etc. The the output you show is what the above program does (sorts on ",4th3rd" column). If you want a different column number to sort by adjust the number accordingly. Remember that sort order is according to ASCII values, but the Unix utilities, Excel will also sort alphanumeric fields this same way.

        In short, check you "column math" the program works according to my testing.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://803896]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2018-06-18 05:54 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (108 votes). Check out past polls.