ini2005 has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have a (big) file that I need to sort. each line looks like that:
"1021135 1021291 + NT_077913.2 118788 118944 + NM_153254.1 LocusID:254173 UTR reference NM_153254.1 -1"
problem is I need to sort it at the following order:
1st - by the first column (number)
then the secondary criteria is the second column (number)
and the third criteria is the 10th column (string)
any advice on how to do it would be helpful.
Thanks
Re: sorting a file - multilevel
by sgifford (Prior) on Jun 14, 2008 at 02:15 UTC
|
As runrig mentioned, Unix sort(1) is a great tool for this, although the syntax sometimes requires a little trial and error.
To do this from Perl, read each line into some kind of data structure, then define your own sorting function that compares two of these data structures by looking at each of the fields, returning 1 if the first is greater, -1 if the second is greater, or going on to the next field if they are the same. The cmp and <=> ("spaceship") operators will help you with this, and they can be cascaded with the || "or" operator.
Here's a simple example (untested):
sub mysort
{
return $a->[0] <=> $b->[0]
||
$a->[1] <=> $b->[1]
||
$a->[9] cmp $b->[9]
}
my @list;
while (<>)
{
chomp;
push @list, [ split ];
}
@list = sort mysort @list;
| [reply] [d/l] [select] |
Re: sorting a file - multilevel
by runrig (Abbot) on Jun 14, 2008 at 01:56 UTC
|
I would just use sort (not sort). Except that it looks like the "10th" position in your file is just the string "UTR". How are you counting columns? | [reply] |
|
Yes, the 10th col is UTR but it varies, ti can be GENE, CDS, RNA..
another problem is that I need GENE to always be first (not regular lexicographic sort)
| [reply] |
|
another problem is that I need GENE to always be first (not regular lexicographic sort)
That's doable...here's a sample (the sed and awk can easily be replaced by perl...left as an exercise):
#!/bin/ksh
awk 'BEGIN {
SORTCD["GENE"] = 1
SORTCD["CDS"] = 2
SORTCD["RNA"] = 3
}
{ print SORTCD[$3], $0 }' <<EOT |
1 1 RNA
1 1 GENE
1 2 CDS
EOT
sort -n -k2,3 -k1,1 | sed -e 's/^[0-9]* //'
| [reply] [d/l] |
Re: sorting a file - multilevel
by salva (Canon) on Jun 14, 2008 at 11:18 UTC
|
Hi, I have a (big) file that I need to sort
"big" is a very relative term, could you provide something more specific?
If you have enough RAM to load all the data in an array, Sort::Key will allow you to sort it easily and probably faster than with any other method:
use Sort::Key::Multi qw(u3_keysort); # u3 stands for 3 unsigned intege
+r keys
my $ix = 0;
my %map_10th = map { $_ => $ix++ } qw(GENE UTR ...);
my @data = ...;
my @sorted = u3_keysort {
my @key = split /\s+/;
($key[0], $key[1], $map_10th{$key[9]})
} @data;
If you don't have enough RAM, then try with Sort::External or just with the sort command provided by your OS. | [reply] [d/l] [select] |
Re: sorting a file - multilevel
by jethro (Monsignor) on Jun 14, 2008 at 02:24 UTC
|
Has the first number always the same length? If yes, you can use unix sort (like runrig suggested) as a first step.
Afterwards the file is now sorted by your first and secondary criteria. Only lines with same first and secondary columns are still unsorted, but they are on consecutive lines and small enough to be sorted in memory
So your program should now read lines from the presorted file and collect lines with equal first and second columns. Sort them with perl sort on the 10th column and write them to a new file.
The new file is now sorted to your criterias.
If unix sort doesn't change the ordering of lines that are equal (which I believe it does, but I'm not sure) then you can do the complete sorting with unix sort. Just use sort with parameter -k=10 to first sort the file by the 10th column, then with -k=1,2 to sort by the first and second column. | [reply] |
|
sort -k 1n -k 2n -k 10 big.file > sorted.big.file
That's equivalent to doing something like this in perl (but the perl version might take a lot longer, esp. if the file, stored in perl as an AoA, is bigger than available RAM):
perl -lane 'push @f,[@F]; END{
print join(" ",@$_)
for (sort{$$a[0]<=>$$b[0] ||
$$a[1]<=>$$b[1] ||
$$a[9] cmp $$b[9]} @f)}' big.file > sorted.big.file
| [reply] [d/l] [select] |
Re: sorting a file - multilevel
by CountZero (Bishop) on Jun 14, 2008 at 18:45 UTC
|
| [reply] [d/l] |
|
|