Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Sorting records on a single field

by TStanley (Canon)
on Jan 20, 2010 at 17:00 UTC ( #818508=perlquestion: print w/ replies, xml ) Need Help??
TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have the following data file, which has been extracted from a log file in Tuxedo:
100644:MWTP_CAT:12002: SERVER:pid=14520:Execution time:TPR015-10:(mill +isec):53 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR015-10:(mill +isec):10 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR015-10:(mill +isec):33 100644:MWTP_CAT:12002: SERVER:pid=16565:Execution time:TPR007-12:(mill +isec):437 100644:MWTP_CAT:12002: SERVER:pid=16565:Execution time:TPR007-12:(mill +isec):470 100644:MWTP_CAT:12002: SERVER:pid=16048:Execution time:TPR009-30:(mill +isec):24 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR012-01E:(mil +lisec):63 100644:MWTP_CAT:12002: SERVER:pid=10427:Execution time:ISCST044:(milli +sec):0 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR012-01E:(mil +lisec):85 100644:MWTP_CAT:12002: SERVER:pid=10428:Execution time:01201E:(millise +c):3
I need to sort this data by the number of milliseconds (execution time), so the above would look like:
100644:MWTP_CAT:12002: SERVER:pid=16565:Execution time:TPR007-12:(mill +isec):470 100644:MWTP_CAT:12002: SERVER:pid=16565:Execution time:TPR007-12:(mill +isec):437 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR012-01E:(mil +lisec):85 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR012-01E:(mil +lisec):63 100644:MWTP_CAT:12002: SERVER:pid=14520:Execution time:TPR015-10:(mill +isec):53 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR015-10:(mill +isec):33 100644:MWTP_CAT:12002: SERVER:pid=16048:Execution time:TPR009-30:(mill +isec):24 100644:MWTP_CAT:12002: SERVER:pid=15866:Execution time:TPR015-10:(mill +isec):10 100644:MWTP_CAT:12002: SERVER:pid=10428:Execution time:01201E:(millise +c):3 100644:MWTP_CAT:12002: SERVER:pid=10427:Execution time:ISCST044:(milli +sec):0
I know that the sort feature I need specifically would be sort { $a cmp $b } but I am unsure as to how I would extract and sort. As always, just give pointers in the right direction. Thanks.

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Sorting records on a single field
Select or Download Code
Re: Sorting records on a single field
by jwkrahn (Monsignor) on Jan 20, 2010 at 17:21 UTC
    print for map $_->[1], sort { $b->[0] <=> $a->[0] } map [ /\(millisec\):(\d+)/, $_ ], @data;
      If the number of interest is always at the end of the lines then a simple /(\d+)$/ would do.
      A combination of rindex and substr instead of the regex would probably be even faster.

      See also this reference work for more pointers.
Re: Sorting records on a single field
by almut (Canon) on Jan 20, 2010 at 17:25 UTC
    As always, just give pointers in the right direction.

    Ok, here are your pointers :)

    • Use a regex or split to extract the column of interest
    • Use the Schwartzian Transform to do the actual sorting. The ST avoids having to do the (relatively expensive) extraction procedure anew for each pairwise comparison ($a <=> $b).
Re: Sorting records on a single field
by ack (Deacon) on Jan 20, 2010 at 17:44 UTC

    There are several good references in the Tutorials section of the Monestary on sorting. I would look in the subsection Getting Deeper Into Perl and the sub-subsection List Processing, Filtering, and Sorting. In particular you should look at transformational sorts; the Schwartzian Sort is, I think, one of the more popular that should meet your needs.

    I would suggest, in particular, any one of three of the tutorials:

    A brief tutorial on Perl's native sorting facilities by BrowserUK,

    Resorting to Sorting by japhy, or

    Complex sorting by vroom

    The first is a good place to start, but the other two are really good, to, IMHO.

    Good luck.

    UPDATE: One thing I should've said (I didn't think about this until I got home last night) is that when you write your sort subroutine (which is what gives the transformation sorting...like the Schartzian Transformation sort, etc., their power...you'll need to parse each line of the input file to isolate the time data that you want to sort on. This will also mean that you'd need to read (slurp) the entire file into an array since the types of sorting mentioned in the references I posed do so in memory. There are some CPAN modules (e.g., Sort::Array) that I believe can sort files without having to slurp the entire file into memory...but I don't have any experience with them so I'm not sure if they can actually do that...maybe other Monks could guide you on that). Again, good luck.

    ack Albuquerque, NM

        I just checked out that reference it is, indeed, an excellent reference. I did not know about it and found it a good reference to put my tool back of "places to look" for info on sorting. Thanks, almut.

        ack Albuquerque, NM
Re: Sorting records on a single field
by Anonymous Monk on Jan 20, 2010 at 18:15 UTC

    If you're just looking for a quick and dirty way to do this, you can do it in your shell.

    sort -rnt: -k9

    Sort reverse, numeric, field separator colon, field 9.

Re: Sorting records on a single field
by planetscape (Canon) on Jan 21, 2010 at 00:11 UTC
Re: Sorting records on a single field
by Lain78 (Initiate) on Jan 21, 2010 at 10:03 UTC

    Hi! I don't have extensive experience with sorting methods, but in this case I think a simple approach would work, like that:

    use strict; use warnings; my $Line; # one input line my @SortedData; # resulting sorted data set # data set example my @Data = ( '100644:MWTP_CAT:12002: SERVER:pid=14520:Execution + time:TPR015-10:(millisec):53', '100644:MWTP_CAT:12002: SERVER:pid=15866:Execution + time:TPR015-10:(millisec):10', '100644:MWTP_CAT:12002: SERVER:pid=15866:Execution + time:TPR015-10:(millisec):33', '100644:MWTP_CAT:12002: SERVER:pid=16565:Execution + time:TPR007-12:(millisec):437', '100644:MWTP_CAT:12002: SERVER:pid=16565:Execution + time:TPR007-12:(millisec):470', '100644:MWTP_CAT:12002: SERVER:pid=16048:Execution + time:TPR009-30:(millisec):24', '100644:MWTP_CAT:12002: SERVER:pid=15866:Execution + time:TPR012-01E:(millisec):63', '100644:MWTP_CAT:12002: SERVER:pid=10427:Execution + time:ISCST044:(millisec):0', '100644:MWTP_CAT:12002: SERVER:pid=15866:Execution + time:TPR012-01E:(millisec):85', '100644:MWTP_CAT:12002: SERVER:pid=10428:Execution + time:01201E:(millisec):3', ); # create sorted data set @SortedData = reverse sort { (split (/:/, $a))[-1] <=> (split (/:/, $b +))[-1] } @Data; ### DEBUG: print input and output sets ### print "Data Set is:\n", join ("\n", @Data), "\n"; print "Sorted Data is:\n", join ("\n", @SortedData), "\n";
      You can get rid of the reverse operation just inverting the order of the comparison operands. In other words, instead of reverse sort { $a <=> $b } @data use sort { $b <=> $a } @data.

      In OP case:

      @SortedData = sort { (split (/:/, $b))[-1] <=> (split (/:/, $a))[-1] } + @Data;

      Using reverse also makes the sort operation unstable (entries with equal sorting keys do not keep their relative positions after the sort operation).

        I see that switching the opereands is better but I don't catch what you mean with "equal sorting keys do not keep their relative positions after the sort operation"... probably I miss something. Could you explain deeply that point? Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://818508]
Approved by keszler
Front-paged by MadraghRua
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2014-07-14 08:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (257 votes), past polls