sorting a file - multilevel

ini2005 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: sorting a file - multilevel by sgifford (Prior) on Jun 14, 2008 at 02:15 UTC
As runrig mentioned, Unix sort(1) is a great tool for this, although the syntax sometimes requires a little trial and error. To do this from Perl, read each line into some kind of data structure, then define your own sorting function that compares two of these data structures by looking at each of the fields, returning 1 if the first is greater, -1 if the second is greater, or going on to the next field if they are the same. The `cmp` and `<=>` ("spaceship") operators will help you with this, and they can be cascaded with the `\|\|` "or" operator. Here's a simple example (untested): `sub mysort { return $a->[0] <=> $b->[0] \|\| $a->[1] <=> $b->[1] \|\| $a->[9] cmp $b->[9] } my @list; while (<>) { chomp; push @list, [ split ]; } @list = sort mysort @list;` [download] -- sgifford's Web page	[reply] [d/l] [select]
Re: sorting a file - multilevel by runrig (Abbot) on Jun 14, 2008 at 01:56 UTC
I would just use sort (not sort). Except that it looks like the "10th" position in your file is just the string "UTR". How are you counting columns?	[reply]
Re^2: sorting a file - multilevel by ini2005 (Novice) on Jun 14, 2008 at 10:08 UTC
Yes, the 10th col is UTR but it varies, ti can be GENE, CDS, RNA.. another problem is that I need GENE to always be first (not regular lexicographic sort)	[reply]
Re^3: sorting a file - multilevel by runrig (Abbot) on Jun 25, 2008 at 20:34 UTC
another problem is that I need GENE to always be first (not regular lexicographic sort) That's doable...here's a sample (the sed and awk can easily be replaced by perl...left as an exercise): `#!/bin/ksh awk 'BEGIN { SORTCD["GENE"] = 1 SORTCD["CDS"] = 2 SORTCD["RNA"] = 3 } { print SORTCD[$3], $0 }' <<EOT \| 1 1 RNA 1 1 GENE 1 2 CDS EOT sort -n -k2,3 -k1,1 \| sed -e 's/^[0-9]* //'` [download]	[reply] [d/l]
Re: sorting a file - multilevel by salva (Canon) on Jun 14, 2008 at 11:18 UTC
Hi, I have a (big) file that I need to sort "big" is a very relative term, could you provide something more specific? If you have enough RAM to load all the data in an array, Sort::Key will allow you to sort it easily and probably faster than with any other method: `use Sort::Key::Multi qw(u3_keysort); # u3 stands for 3 unsigned intege +r keys my $ix = 0; my %map_10th = map { $_ => $ix++ } qw(GENE UTR ...); my @data = ...; my @sorted = u3_keysort { my @key = split /\s+/; ($key[0], $key[1], $map_10th{$key[9]}) } @data;` [download] If you don't have enough RAM, then try with Sort::External or just with the `sort` command provided by your OS.	[reply] [d/l] [select]
Re: sorting a file - multilevel by jethro (Monsignor) on Jun 14, 2008 at 02:24 UTC
Has the first number always the same length? If yes, you can use unix sort (like runrig suggested) as a first step. Afterwards the file is now sorted by your first and secondary criteria. Only lines with same first and secondary columns are still unsorted, but they are on consecutive lines and small enough to be sorted in memory So your program should now read lines from the presorted file and collect lines with equal first and second columns. Sort them with perl sort on the 10th column and write them to a new file. The new file is now sorted to your criterias. If unix sort doesn't change the ordering of lines that are equal (which I believe it does, but I'm not sure) then you can do the complete sorting with unix sort. Just use sort with parameter -k=10 to first sort the file by the 10th column, then with -k=1,2 to sort by the first and second column.	[reply]
Re^2: sorting a file - multilevel by graff (Chancellor) on Jun 14, 2008 at 14:33 UTC
Has the first number always the same length? Length of a numeric field is not an issue. Using unix (or gnu) sort, the OP problem would be a simple command line: `sort -k 1n -k 2n -k 10 big.file > sorted.big.file` [download] That's equivalent to doing something like this in perl (but the perl version might take a lot longer, esp. if the file, stored in perl as an AoA, is bigger than available RAM): `perl -lane 'push @f,[@F]; END{ print join(" ",@$_) for (sort{$$a[0]<=>$$b[0] \|\| $$a[1]<=>$$b[1] \|\| $$a[9] cmp $$b[9]} @f)}' big.file > sorted.big.file` [download]	[reply] [d/l] [select]
Re: sorting a file - multilevel by CountZero (Bishop) on Jun 14, 2008 at 18:45 UTC
If it is a really big file, dump it into a database, index the fields you have to sort on and write some simple SQL to do the sort: `SELECT * FROM BigTable ORDER BY Field01, Field02, Field10_bis` You will of course have to add a Field10_bis so it sorts the 10th field in the required order! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks