Re: efficient perl code to count, rank

in reply to efficient perl code to count, rank

Like other said you'd need to put serious work into an SSCCE.

From the glimpses I understood, I'd say this kind of work is normally done in a database. (who needs 14m rows sorted except in a database?)

And I agree with the others that in your Perl solution memory is most likely the bottleneck.

So avoid to load the whole file and it will be way faster.

Most of what you describe can be easily done without keeping everything in memory, simply by processing line by line.

BUT sorting is trickier.

A pragmatic way is to only keep the "count" plus an associated line-number (resp. seek position into the unsorted file) in memory for sorting, this will reduce your memory consumption by factor of your "1100 to 1500columns".

In a second phase you can reorder the lines then.

E.g. my laptop ran the following code in under 2 minutes, to sort 14m arrays [ random rank, id ] .

use strict;
use warnings;
use Data::Dump qw/pp dd/;

my @a = sort { $a->[0] <=> $b->[0] } map { [ rand 14e6, $_ ]  } 0..14e
+6;

pp [ @a[0..100] ];          # show me the first 100
[download]

This included the overhead for swapping, my fan was roaring up. But I suppose you have far more RAM at hand.

Otherwise there are for sure CPAN modules named like File::Sort (NB: no experience or recommendation!) which can do the heavy lifting for you.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery}

In Section Seekers of Perl Wisdom