http://www.perlmonks.org?node_id=11135145


in reply to Re^4: efficient perl code to count, rank
in thread efficient perl code to count, rank

Just a reminder, you were the first one suggesting that the OP needs more RAM, see Re: efficient perl code to count, rank

Anything can be done with Perl, but search and sort operations requiring Perl to keep all data in memory are usually easier solved (read out-of-the-box) with a DB.

Otherwise they require re-implementing sophisticated algorithms to manually "swap" RAM and Disk structures, which doesn't qualify as out-of-the-box for me.

NB: But IF the OP really needs such operations is still unclear!

We are still speculating what exactly he wanted to be ranked/sorted. (like demonstrated, sorting 14m entries is still feasible in RAM with Perl under 2min, but how does it scale with larger data?)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^5: efficient perl code to count, rank

Replies are listed 'Best First'.
Re^6: efficient perl code to count, rank
by haj (Priest) on Jul 19, 2021 at 07:21 UTC

    Yeah, the RAM problem is one one which becomes immediately apparent when looking at the code: Hence my wording "a first guess". As has already been written in this thread (and demonstrated by tybalt89's code), it can be eliminated by working through the file line by line, so with a small change in code the RAM problem does no longer exist. Also, it has nothing to do with sorting, it's just the attempt to slurp a 62GB file into an array. In the followups to the article you quoted "sorting" isn't even mentioned, because it is irrelevant.

    We are still speculating what exactly he wanted to be ranked/sorted.

    Looking at the code presented in the original posting should be considered an option. tybalt89 came up with the following explanation, which matches my own interpretation:

    You were doing the ranking sort for each column...
    I'm simply assuming that the OP's code performs the operation they want to be done, albeit inefficient. In that code there is not one sort over 14M entries, but there are thousand sorts (one per column). The OP's code does these 1000 sorts 14M times, that's why it won't finish in time, even for small arrays.

    I hope that the Monks will eventually give tybalt89's article the ranking it deserves.