Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

efficient perl code to count, rank

by Perl_Noob2021 (Initiate)
on Jul 17, 2021 at 17:34 UTC ( #11135101=perlquestion: print w/replies, xml ) Need Help??

Perl_Noob2021 has asked for the wisdom of the Perl Monks concerning the following question:

Need help for possible efficient perl code to count, rank and get the percentage. I have the following code, already running for 6hours and not yet complete. The target filed is a csv (all text) and is about 14m to 14.5m rows, and around 1100 to 1500columns and 62gig size A run of 4hours is acceptable. what it does: - do a count (like a countif in excel) - get the percent (based on 14m rows) - get the rank based on count Appreciate any help.
$x="Room_reserve.csv"; $in = "D:\\package properties\\${x}.csv"; $in = "D:\\package properties\\${x}.csv"; $out = "D:\\package properties\\output\\${x}_output.csv"; open($fh, '<', $in) or die "Could not open file '$file' $!"; @data = <$fh>; close($fh); %counts; @columns; $first = 1; #counter foreach $dat (@data) { chomp($dat); @rows = split(',',$dat); if ($first == 1) { $first = 0; next; } else { $count = 1; foreach $i (0..$#rows) { if ( exists($columns[$i]{$rows[$i]}) ) { $columns[$i]{$rows[$i]}++; } else { $columns[$i]{$rows[$i]} = int($count); } } } } #output $first = 1; open($fh, '>', $out) or die "Could not open file '$file' $!"; foreach $dat (@data) { chomp($dat); @rows = split(',',$dat); foreach $i (0..$#rows) { if ($i > 6) { #for modifying name if ( $first == 1 ) { $line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]", "C +ount_$rows[$i]", $rows[$i]); print $fh "$line,"; if ( $i == $#rows ) { $first = 0; } } else { @dat_val = reverse sort { $a <=> $b } values %{$columns[$i]} +; %ranks = {}; $rank_cnt = 0; foreach $val (@dat_val) { if ( ! exists($ranks{$val}) ) { $rank_cnt++; } $ranks{$val} = $rank_cnt; } $rank = $ranks{$columns[$i]{$rows[$i]}}; $cnt = $columns[$i]{$rows[$i]}; $ave = ($cnt / 14000000) * 100; $line = join( ",", $rank, $ave, $cnt, $rows[$i]); print $fh "$line,"; } } else { print $fh "$rows[$i],"; } } print $fh "\n"; } close($fh);

Replies are listed 'Best First'.
Re: efficient perl code to count, rank
by haj (Curate) on Jul 17, 2021 at 18:05 UTC

    Just a first guess: Do you have enough RAM? You're loading 62gig "raw data" into @data, and if you don't have sufficient RAM then swapping will cause that program to run like a snail.

      I have 16gib, would that be enough?

        Well, obviously 16GB are not enough when you're loading 62GB of data into memory. This means lots of swapping. Start your program, and observe the progress with top.

        Let's say your program fills 16GB with data from the disk. There are 46GB more. So it writes the first 16GB to swap. Then, you probably run out of swap space as well (which is usually of the same size as RAM), so the system has to allocate more swap. Phew.

        When the processing loop starts, it starts with the first lines, which it needs to read from disk again (from swap in this case). But since your RAM is already full, it now swaps out the last records. While the loop continues, the first lines will be written to swap again. So you have a cyclic re-reading and writing, which is the worst which can happen to swap (or cache).

        You should read and process record by record, even if that means you have to go through your original file twice.

        Physically reading from Disk is slow. SSDs are faster, but still significantly slower than RAM.

Re: efficient perl code to count, rank
by LanX (Cardinal) on Jul 17, 2021 at 19:20 UTC
    Like other said you'd need to put serious work into an SSCCE.

    From the glimpses I understood, I'd say this kind of work is normally done in a database. (who needs 14m rows sorted except in a database?)

    And I agree with the others that in your Perl solution memory is most likely the bottleneck.

    So avoid to load the whole file and it will be way faster.

    Most of what you describe can be easily done without keeping everything in memory, simply by processing line by line.

    BUT sorting is trickier.

    A pragmatic way is to only keep the "count" plus an associated line-number (resp. seek position into the unsorted file) in memory for sorting, this will reduce your memory consumption by factor of your "1100 to 1500columns".

    In a second phase you can reorder the lines then.

    E.g. my laptop ran the following code in under 2 minutes, to sort 14m arrays [ random rank, id ] .

    use strict; use warnings; use Data::Dump qw/pp dd/; my @a = sort { $a->[0] <=> $b->[0] } map { [ rand 14e6, $_ ] } 0..14e +6; pp [ @a[0..100] ]; # show me the first 100

    This included the overhead for swapping, my fan was roaring up. But I suppose you have far more RAM at hand.

    Otherwise there are for sure CPAN modules named like File::Sort (NB: no experience or recommendation!) which can do the heavy lifting for you.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      It should be noted that at no point does the code sort 14M arrays. The variable @rows actually holds the fields for the current line. The code sorts, for every column, the numbers of occurrences of different values. While a column could hold 14M different values in 14M lines, this is not the case here: With 14M lines of 1400 fields each, and 62GB in total, the average column has a data width of two bytes. You can only cram so much different values into two bytes (especially if it's text) - that's several orders of magnitude away from 14M and should fit into memory quite easily.

      The sorting problem in the code is, as has been pointed out, that the sorting is done 14M times instead of once.

        Maybe, maybe not.

        Without example IN and OUT it's hard to guess what an OP really tried to achieve in messy code... 🤷🏾

        That's why I asked for an SSCCE

        And ranking columns looks weird.

        Anyway, the question how to sort data which doesn't fit into memory is more interesting for me!°

        Call it a thread drift, but the monastery is currently not really busy answering interesting questions. ;)

        And I learned a lot about sorting ...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        °) like

      Thanks everyone for the comments. Appreciate the discussion Will look into this. Saw also some article that I can also use Text::CSV_XS, so will check it out as well together with File::Sort.
Re: efficient perl code to count, rank
by haj (Curate) on Jul 17, 2021 at 23:00 UTC

    Let me add another thing: You are sorting over values %{$columns[$i]} for every one of the 14M records, though the value of @columns doesn't change after the first run through @data. You should calculate @dat_val once for the 1400 columns between the two runs through @data.

    From the fact that you output a percentage I guess that you have "about 100" different values in each column. 100-element-sorts are rather fast, but still, 1400 sorts of 100-element-arrays are considerably faster than 14M * 1400 sorts of 100-element arrays.

Re: efficient perl code to count, rank
by tybalt89 (Prior) on Jul 18, 2021 at 18:03 UTC

    Since you didn't provide a small test data set I had to guess about it and then fake it (I may have guessed wrong).

    You were doing the ranking sort for each column for each line when it only had to be done once per column.

    I am reading through the file twice so that the whole file does not have to be stored in memory. Because of that I had to change your file handle names.

    I had to change your I/O because a) I'm not on windows, and b) debug. I think I labeled all those changes with FIXME.

    Of course, comment out the Data::Dump before running on a large file. You'll find Data::Dump is extremely slow on huge data structures.

    As I've said to others here, at least my code does not fail any of your provided test cases :)

    #!/usr/bin/perl use strict; # use warnings; use List::Util qw( uniq ); my $x="Room_reserve.csv"; my $in = "D:\\package properties\\${x}.csv"; $in = "D:\\package properties\\${x}.csv"; # ????????? duplicate line?? +? my $out = "D:\\package properties\\output\\${x}_output.csv"; print my $testdata = <<END; # FIXME x,x,x,x,x,x,x,one,two,three x,x,x,x,x,x,x,foo,bar,baz x,x,x,x,x,x,x,foo2,bar7,baz3 x,x,x,x,x,x,x,foo2,bar7,baz3 x,x,x,x,x,x,x,foo,bax,baz x,x,x,x,x,x,x,foo,bar,baz x,x,x,x,x,x,x,foo,bar,baz END $in = \$testdata; #FIXME open(my $infh, '<', $in) or die "Could not open file '$in' $!"; my @columns; my $lines = 0; <$infh>; # skip first line while( <$infh> ) { my @rows = split /[,\n]/; $columns[$_]{$rows[$_]}++ for 0..$#rows; $lines++; } my @ranksbycolumn; for ( @columns ) { my @dat_val = sort { $b <=> $a } uniq values %$_; my %ranks; @ranks{ @dat_val } = 1 .. @dat_val; push @ranksbycolumn, \%ranks; } use Data::Dump 'dd'; dd 'columns', \@columns, 'ranksbycolumn', \@ranks +bycolumn; # FIXME my $first = 1; #output #open(my $outfh, '>', $out) or die "Could not open file '$out' $!"; #F +IXME open(my $outfh, ">&STDOUT") or die "Could not open file STDOUT$!"; #FI +XME seek $infh, 0, 0 or die "Could not seek file '$in' $!"; while( <$infh> ) { my @rows = split /[,\n]/; foreach my $i (0..$#rows) { if ($i > 6) { if ( $first == 1 ) # for modifying name { my $line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]", "Count_$rows[$i]", $rows[$i]); print $outfh "$line,"; if ( $i == $#rows ) { $first = 0; } } else { my $cnt = $columns[$i]{$rows[$i]}; my $rank = $ranksbycolumn[$i]{$cnt}; # my $ave = ($cnt / 14000000) * 100; # FIXME my $ave = int +($cnt / $lines) * 100; # FIXME my $line = join( ",", $rank, $ave, $cnt, $rows[$i]); print $outfh "$line,"; } } else { print $outfh "$rows[$i],"; } } print $outfh "\n"; } close $outfh;


    x,x,x,x,x,x,x,one,two,three x,x,x,x,x,x,x,foo,bar,baz x,x,x,x,x,x,x,foo2,bar7,baz3 x,x,x,x,x,x,x,foo2,bar7,baz3 x,x,x,x,x,x,x,foo,bax,baz x,x,x,x,x,x,x,foo,bar,baz x,x,x,x,x,x,x,foo,bar,baz ( "columns", [ { x => 6 }, { x => 6 }, { x => 6 }, { x => 6 }, { x => 6 }, { x => 6 }, { x => 6 }, { foo => 4, foo2 => 2 }, { bar => 3, bar7 => 2, bax => 1 }, { baz => 4, baz3 => 2 }, ], "ranksbycolumn", [ { 6 => 1 }, { 6 => 1 }, { 6 => 1 }, { 6 => 1 }, { 6 => 1 }, { 6 => 1 }, { 6 => 1 }, { 2 => 2, 4 => 1 }, { 1 => 3, 2 => 2, 3 => 1 }, { 2 => 2, 4 => 1 }, ], ) x,x,x,x,x,x,x,Rank_one,Percent_one,Count_one,one,Rank_two,Percent_two, +Count_two,two,Rank_three,Percent_three,Count_three,three, x,x,x,x,x,x,x,1,66,4,foo,1,50,3,bar,1,66,4,baz, x,x,x,x,x,x,x,2,33,2,foo2,2,33,2,bar7,2,33,2,baz3, x,x,x,x,x,x,x,2,33,2,foo2,2,33,2,bar7,2,33,2,baz3, x,x,x,x,x,x,x,1,66,4,foo,3,16,1,bax,1,66,4,baz, x,x,x,x,x,x,x,1,66,4,foo,1,50,3,bar,1,66,4,baz, x,x,x,x,x,x,x,1,66,4,foo,1,50,3,bar,1,66,4,baz,
      Thanks for showing example in and output!

      So the OP wants to add a rank and percentage to every field of his 1100+ columns in 14m rows and blow up his 63GB file by factor 3?

      Interesting ... :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        You're Welcome!

        Also count, for a factor of four. Imagine the fun of looking through a 14000000 row by 4400+ column spreadsheet :)

Re: efficient perl code to count, rank
by roboticus (Chancellor) on Jul 17, 2021 at 18:23 UTC


    I took a brief look at your code and can't immediately identify the cause of why the code is so slow. However, your code is difficult to look at.

    I'd suggest first using strict and warnings to ensure that your code is clean. (I identified one minor problem that way.)

    I'd next suggest that you clean up the indentation so that the overall code structure is easy enough to look at so people can identify the loops and nesting level of various statements without having to count braces. You might consider using perltidy to clean up the indentation.

    Since you're running on a largish file (62gig), you might be thrashing (swapping memory in and out of disk) which can cause quite a slowdown in performance. If you're on a windows box, then using the performance view of Task manager would show your memory at 100% full and a high rate of Disk I/O if you're thrashing. If so, you'll want to look at the problem and see if you can handle it in multiple passes, such as processing a subset of the columns each time through.

    Having said all that, I can't make sense of what you're trying to do in your code, so you might try making a tiny dataset as an example, and process it, and verify that it's doing what you're expecting.


    When your only tool is a hammer, all problems look like your thumb.

      The memory is about 80% using 6gig. it does work on small subset, but when i add "use strict; use warnings", it shows error and was aborted due to compilation errors. will try to provide clear codes
Re: efficient perl code to count, rank
by Tanktalus (Canon) on Jul 17, 2021 at 20:11 UTC

    Simple rule: when you have data, think database. I don't care if it's mysql, postgres, Db2, sqlite, but use a db.

    In this particular case, especially, you'll probably find it runs faster to load the data, ask the db to give you the output, and discard the db than it is to try to do it in memory. Databases are often optimised for this type of data access, and will be a much more efficient (time-wise especially) use of disk than swapping / paging. Sqlite probably will suffice, but, if not, mysql and postgres are free, and not difficult to install on most linux distros, and not really much harder on Windows, even without WSL.

    I know, you asked a perl question, and there's not a lick of perl in this answer. But the reality is that while you can do this in any turing-complete language, you still should use the right tool for the job, and, in this case, a database is that right tool.

      I doubt that a database will help here.

      This is a linear process through a large dataset, and through the whole dataset. No query, no selection, no filter, no relations or anything where an index would help, no need for transactions or concurrent access. The hashes which collect the result are good enough as a surrogate for a in-memory database (and they should fit into memory)

      A database comes with significant overhead: You need to INSERT 14 millions of records before you can even start. This means reading 62GB from disk, piping them through your database socket, and then the database writes them to disk again. And then you SELECT all of them which means reading from disk and piping them through the database socket again. How would a database compensate for this?

        A database comes with significant overhead: You need to INSERT 14 millions of records before you can even start

        That's not necessarily true: one can (for instance, in postgres) use the SQL-machinery against a so-called foreign table, that uses a text file as underlying data. That means no INSERT is necessary.

        Making a foreign table takes no time (after all, it's really just a configuration) but of course any reading or sorting with many GBs will take approximately as long as in any another programming language. The advantage would be access via SQL. (BTW, I'm not saying such access via database is useful for the OP, he may have the overhead of learning this particular trick).

        (And yes, I did try it out: retrieving a few values from a foreign table that sits on top of a csv-file of 27GB (1250 columns, 2M rows), via

        SELECT column10 , column20 FROM junk.broadcsv ORDER BY column100 DESC LIMIT 10

        took ~10 minutes on my old 8GB desktop)

        • You don't need to read, pipe, write to disk. Many/most dbs have an import-from-csv option that can read from local disk directly. Yes, it's still a read/write operation, with the added overhead of any required indexing, but we've eliminated the pipe. Not that a local pipe is necessarily slow, but a 62GB reduction in memory copying is still a 62GB reduction in memory copying.
        • If you create the table with the proper indexing at the outset, which means that the load will be a bit slower, then all operations after that will be significantly faster, and be such directly from disk. You don't need to sort anything if the index is already sorted. You don't need to count anything if the db already knows the count.
        • And, finally, the big one, the database will be significantly faster and better optimised for swapping intermediate values to disk and back than the kernel is. Leaving this to the kernel is an option, and it works, as you've found out, but what the kernel swaps to disk may not be exactly what you want swapped to disk. A database will have an entire memory management unit for exactly this purpose, where it reads what it needs from disk (whether the tables or the indexes), discards stuff, and saves intermediate information back to temporary space if necessary. Most likely, though, it won't do this, but will read all of the data possibly multiple times from disk, and discard anything it's not using at the current moment in time. This sounds slow, but the reality is that the original code is also doing this, just moreso. Thee kernel is the code that is writing stuff to disk, whereas a database would likely guess correctly more often. Continued from here, though, is that if you have the right indexes on load, the data that the database needs to load may not actually be any of the data from the actual table, it may only be the indexes (which means loading less data per pass, and only the relevant data), and that would allow it to load more at a time, and to possibly not even load any piece of data twice. Maybe. But, with the right query, the database can figure this out.

          If the output data starts to get too big, the database can even persist it to disk to be further manipulated, in a specialised structure (that is, a database table), that is amenable to continued modifications as per all of the above caching and discarding and what have you, until it is done producing the output, and then it can return that data more or less straight from disk.

        Your basic problem is that you've long passed by your system's memory, and you're hitting kernel swapping. Other than that, the rest of your supposition is probably entirely correct. Dealing with this problem is non-trivial, and is one of the things that actual databases are good at.

        Back a couple jobs ago, when I worked for one of the big-three commercial databases, I had access to systems with 256GB RAM. If I were to need to do what you are doing back then on those machines, yeah, perl in-memory likely would have sufficed, and been faster than using the commercial db I had access to (as you rightfully point out, my solution comes with some overhead). But we all have to work within the constraints given to us, and if your system has a "paltry" 16GB RAM, that's your constraint, and you have to find the algorithm that takes that into account. There are such out there, and they've generally been implemented by the database systems, so there's no need to reinvent that wheel, just re-use it.

        Also, when your management likes the output you just produced, they're going to ask for more and more analytics. You just know it's going to happen. Throw-away code is very rarely thrown away. And then re-querying the database for more stuff is going to be entirely trivial.

        > How would a database compensate for this?

        Databases use structures like B-trees for indexing and sorting and are pretty good in balancing these trees between RAM and disk. °

        They optimize the trade-off between memory and time.

        Perl is pretty bad in sorting linear stuff which is part in RAM and part on disk.

        (Well you could tie an AoA to a file representing a table. Sorting that array would go without much RAM but mean constant overhead with each access. This is certainly worth a shot, but as I said DBs have this already optimized.)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery


        °) from WP B-tree

          Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as disks ...

          ... for the purpose of efficiently managing index pages for large random access files. The basic assumption was that indexes would be so voluminous that only small chunks of the tree could fit in main memory.

      A reply falls below the community's threshold of quality. You may see it by logging in.
Re: efficient perl code to count, rank
by eyepopslikeamosquito (Bishop) on Jul 19, 2021 at 03:39 UTC

    The lack of an SSCCE is unfortunate. Without one, we are forced to guess, and I have little confidence in my guesses, based on past experience:

    • Re^4: Fastest way to lookup a point in a set : Surprising use case where BrowserUk found Perl hashes to be way faster than an SQLite memory-based database.
    • Re^2: More Betterer Game of Life : Running time reduced from 1635 seconds to 17 seconds. Tweaking the code, via a long series of micro-optimizations, reduced running time from 1635 secs to 450 secs (3.6 times faster) ... while finding a better algorithm reduced from 450 secs to 17 secs (26.5 times faster).
    • The 10**21 Problem (Part I) : Running time reduced from 50 million years to one year. Many surprises along the way in this long four-part series.

    From these experiences, I learnt the importance of don't assume measure and especially find a better algorithm.

Re: efficient perl code to count, rank
by karlgoethebier (Abbot) on Jul 18, 2021 at 16:31 UTC

    If you really want to avoid SQLite you could take a look at mce_loop_f from MCE::Loop.

    The name is the program as _f stands for file.

    «The Crux of the Biscuit is the Apostrophe»

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11135101]
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2021-07-24 04:05 GMT
Find Nodes?
    Voting Booth?

    No recent polls found