Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

sorting a file - multilevel

by ini2005 (Novice)
on Jun 14, 2008 at 01:40 UTC ( #692039=perlquestion: print w/replies, xml ) Need Help??

ini2005 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a (big) file that I need to sort. each line looks like that:

"1021135 1021291 + NT_077913.2 118788 118944 + NM_153254.1 LocusID:254173 UTR reference NM_153254.1 -1"

problem is I need to sort it at the following order:
1st - by the first column (number)
then the secondary criteria is the second column (number)
and the third criteria is the 10th column (string)

any advice on how to do it would be helpful. Thanks

Replies are listed 'Best First'.
Re: sorting a file - multilevel
by sgifford (Prior) on Jun 14, 2008 at 02:15 UTC
    As runrig mentioned, Unix sort(1) is a great tool for this, although the syntax sometimes requires a little trial and error.

    To do this from Perl, read each line into some kind of data structure, then define your own sorting function that compares two of these data structures by looking at each of the fields, returning 1 if the first is greater, -1 if the second is greater, or going on to the next field if they are the same. The cmp and <=> ("spaceship") operators will help you with this, and they can be cascaded with the || "or" operator.

    Here's a simple example (untested):

    sub mysort { return $a->[0] <=> $b->[0] || $a->[1] <=> $b->[1] || $a->[9] cmp $b->[9] } my @list; while (<>) { chomp; push @list, [ split ]; } @list = sort mysort @list;
Re: sorting a file - multilevel
by runrig (Abbot) on Jun 14, 2008 at 01:56 UTC
    I would just use sort (not sort). Except that it looks like the "10th" position in your file is just the string "UTR". How are you counting columns?

      Yes, the 10th col is UTR but it varies, ti can be GENE, CDS, RNA..

      another problem is that I need GENE to always be first (not regular lexicographic sort)

        another problem is that I need GENE to always be first (not regular lexicographic sort)
        That's's a sample (the sed and awk can easily be replaced by perl...left as an exercise):
        #!/bin/ksh awk 'BEGIN { SORTCD["GENE"] = 1 SORTCD["CDS"] = 2 SORTCD["RNA"] = 3 } { print SORTCD[$3], $0 }' <<EOT | 1 1 RNA 1 1 GENE 1 2 CDS EOT sort -n -k2,3 -k1,1 | sed -e 's/^[0-9]* //'
Re: sorting a file - multilevel
by salva (Canon) on Jun 14, 2008 at 11:18 UTC
    Hi, I have a (big) file that I need to sort

    "big" is a very relative term, could you provide something more specific?

    If you have enough RAM to load all the data in an array, Sort::Key will allow you to sort it easily and probably faster than with any other method:

    use Sort::Key::Multi qw(u3_keysort); # u3 stands for 3 unsigned intege +r keys my $ix = 0; my %map_10th = map { $_ => $ix++ } qw(GENE UTR ...); my @data = ...; my @sorted = u3_keysort { my @key = split /\s+/; ($key[0], $key[1], $map_10th{$key[9]}) } @data;

    If you don't have enough RAM, then try with Sort::External or just with the sort command provided by your OS.

Re: sorting a file - multilevel
by jethro (Monsignor) on Jun 14, 2008 at 02:24 UTC
    Has the first number always the same length? If yes, you can use unix sort (like runrig suggested) as a first step.

    Afterwards the file is now sorted by your first and secondary criteria. Only lines with same first and secondary columns are still unsorted, but they are on consecutive lines and small enough to be sorted in memory

    So your program should now read lines from the presorted file and collect lines with equal first and second columns. Sort them with perl sort on the 10th column and write them to a new file.

    The new file is now sorted to your criterias.

    If unix sort doesn't change the ordering of lines that are equal (which I believe it does, but I'm not sure) then you can do the complete sorting with unix sort. Just use sort with parameter -k=10 to first sort the file by the 10th column, then with -k=1,2 to sort by the first and second column.

      Has the first number always the same length?

      Length of a numeric field is not an issue. Using unix (or gnu) sort, the OP problem would be a simple command line:

      sort -k 1n -k 2n -k 10 big.file > sorted.big.file
      That's equivalent to doing something like this in perl (but the perl version might take a lot longer, esp. if the file, stored in perl as an AoA, is bigger than available RAM):
      perl -lane 'push @f,[@F]; END{ print join(" ",@$_) for (sort{$$a[0]<=>$$b[0] || $$a[1]<=>$$b[1] || $$a[9] cmp $$b[9]} @f)}' big.file > sorted.big.file
Re: sorting a file - multilevel
by CountZero (Bishop) on Jun 14, 2008 at 18:45 UTC
    If it is a really big file, dump it into a database, index the fields you have to sort on and write some simple SQL to do the sort: SELECT * FROM BigTable ORDER BY Field01, Field02, Field10_bis

    You will of course have to add a Field10_bis so it sorts the 10th field in the required order!


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://692039]
Approved by ferreira
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2023-06-02 02:37 GMT
Find Nodes?
    Voting Booth?

    No recent polls found