Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: efficient perl code to count, rank

by LanX (Sage)
on Jul 17, 2021 at 19:20 UTC ( #11135107=note: print w/replies, xml ) Need Help??

in reply to efficient perl code to count, rank

Like other said you'd need to put serious work into an SSCCE.

From the glimpses I understood, I'd say this kind of work is normally done in a database. (who needs 14m rows sorted except in a database?)

And I agree with the others that in your Perl solution memory is most likely the bottleneck.

So avoid to load the whole file and it will be way faster.

Most of what you describe can be easily done without keeping everything in memory, simply by processing line by line.

BUT sorting is trickier.

A pragmatic way is to only keep the "count" plus an associated line-number (resp. seek position into the unsorted file) in memory for sorting, this will reduce your memory consumption by factor of your "1100 to 1500columns".

In a second phase you can reorder the lines then.

E.g. my laptop ran the following code in under 2 minutes, to sort 14m arrays [ random rank, id ] .

use strict; use warnings; use Data::Dump qw/pp dd/; my @a = sort { $a->[0] <=> $b->[0] } map { [ rand 14e6, $_ ] } 0..14e +6; pp [ @a[0..100] ]; # show me the first 100

This included the overhead for swapping, my fan was roaring up. But I suppose you have far more RAM at hand.

Otherwise there are for sure CPAN modules named like File::Sort (NB: no experience or recommendation!) which can do the heavy lifting for you.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: efficient perl code to count, rank
by haj (Curate) on Jul 18, 2021 at 18:49 UTC

    It should be noted that at no point does the code sort 14M arrays. The variable @rows actually holds the fields for the current line. The code sorts, for every column, the numbers of occurrences of different values. While a column could hold 14M different values in 14M lines, this is not the case here: With 14M lines of 1400 fields each, and 62GB in total, the average column has a data width of two bytes. You can only cram so much different values into two bytes (especially if it's text) - that's several orders of magnitude away from 14M and should fit into memory quite easily.

    The sorting problem in the code is, as has been pointed out, that the sorting is done 14M times instead of once.

      Maybe, maybe not.

      Without example IN and OUT it's hard to guess what an OP really tried to achieve in messy code... 🤷🏾

      That's why I asked for an SSCCE

      And ranking columns looks weird.

      Anyway, the question how to sort data which doesn't fit into memory is more interesting for me!

      Call it a thread drift, but the monastery is currently not really busy answering interesting questions. ;)

      And I learned a lot about sorting ...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) like

Re^2: efficient perl code to count, rank
by Perl_Noob2021 (Initiate) on Jul 17, 2021 at 22:26 UTC
    Thanks everyone for the comments. Appreciate the discussion Will look into this. Saw also some article that I can also use Text::CSV_XS, so will check it out as well together with File::Sort.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11135107]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2021-09-26 14:13 GMT
Find Nodes?
    Voting Booth?

    No recent polls found