Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Efficient way to sum columns in a file

by sk (Curate)
on Apr 13, 2005 at 04:51 UTC ( #447263=perlquestion: print w/replies, xml ) Need Help??
sk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I was watching someone trying to find the sum of a particular column in a file. He was taking it a data-application (meant to work with large files) to calculate this sum.

It takes a while to setup the app as you have to read the entire file and give it variable names etc. So i wrote this one liner which worked great (saved a lot of time).

cut -d, -f7 in.csv| perl -nle '$sum += $_; print ("Sum = $sum") if eof +;'

I could avoid the "cut" but i didn't see a huge advantage (if someone can give me good reasons to avoid cut that will be nice!)....

Please note that the file is pretty large (around 5 million rows and a few hundred columns)... Since it worked out well, that person asked me how to modify the code to make it work for 5 columns. I immediately used an array (return from a  split /,/)and looped through the list to get the sum of the columns every time a new row is sent in! Little did I realize at the time of writing that this will have horrible performance... After letting it run for a few minutes I realized that looping many times (millions!) is not such a good idea (bad idea rather?)... Maybe just declaring 5 variables would have been better...

So my question is how would Monks handle such a problem?

Thanks all for your time!


Replies are listed 'Best First'.
Re: Efficient way to sum columns in a file
by dave0 (Friar) on Apr 13, 2005 at 05:08 UTC
    If you want to get rid of the 'cut', you can do it all in Perl using the -a and -F switches (see perlrun for details) to autosplit the input line into the @F array:
    perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof' in.csv
      Thanks dave! I have never used aF switches before. I shall remember to use this instead of my  split/,/; However I am not sure if perl's split is more efficient than cut... I guess I can do a quick field test on a large file!


Re: Efficient way to sum columns in a file
by tlm (Prior) on Apr 13, 2005 at 11:55 UTC

    By testing eof() instead of eof in dave0's one-liner, you can extend its functionality to sum across multiple files. Like this:

    # First we make to test files % perl -le 'print join(",", map int(rand(1000)), 1..10) for 1..50' > i +n1.csv % perl -le 'print join(",", map int(rand(1000)), 1..10) for 1..50' > i +n2.csv % head -3 in1.csv 94,434,249,267,649,367,572,579,498,369 452,735,420,543,832,198,28,86,67,382 801,339,978,859,85,719,758,89,191,377 # Now we run dave0's original script on them perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof' in1.csv in2 +.csv Sum is 23789 Sum is 48874 # Now let's try a small change --------------------------. # | # V perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof()' in1.csv i +n2.csv Sum is 48874
    What happens is that eof (no parens!) returns true at the end of each file in @ARGV, while eof() is true only at the end of the last one (see eof). In the case of a single input file, the behavior is the same as before.

    To get subtotals, you'd do

    perl -lanF, -e '$total += $F[6]; $sub += $F[6]; print "Subtotal: $sub" + and $sub = 0 if eof; print "Total: $total" if eof()' in1.csv in2.csv Subtotal: 23789 Subtotal: 25085 Total: 48874

    An alternative to testing for eof(), is to put everything in an END. e.g.

    perl -lanF, -e '$sum += $F[6]; END{ print "Sum is $sum" }' in1.csv in2 +.csv

    Update: Bug in last one-liner corrected.

    the lowliest monk

      I don't think the '$total += $sub += $F6;' part is doing what you expect it to... Actually, it doesn't add $F6 to both $total and $sub, but rather add it to $sub and then add $sub to $total, which generates a wrong total.

        The totals was obviously wrong; that should have alerted me to the error. Thank you very much!

        the lowliest monk

Re: Efficient way to sum columns in a file
by Random_Walk (Prior) on Apr 13, 2005 at 12:58 UTC

    I generated 500,000 lines of random CSV with this script

    #!/usr/bin/perl use strict; use warnings; # create source numbers if they don't exist my $many = 500000; my $source='numbers.csv'; open CSV, '>', $source or die "can't write to $source: $!\n"; for (1..$many) { my @numbers; push @numbers, (int rand 1000) for (1..5); print CSV join ",",@numbers; print CSV $/; }

    Then I tried a few one liners to sum the columns, I ran each twice and post the second timing to allow for cache

    nph>time cat numbers.csv | perl -nle'@d=split /,/;$a[$_]+=$d[$_] for ( +0..4);END{print join "\t", @a}' 249959157 249671314 249649377 250057435 249420 +634 real 0m17.10s user 0m15.46s sys 0m0.08s nph>time perl -nle'my @d=split /,/;$a[$_]+=$d[$_] for (0..4);END{print + join "\t", @a}' numbers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m13.71s user 0m12.77s sys 0m0.04s nph>time perl -nle'my($a,$b,$c,$d,$e)=split /,/;$ta+=$a, $tb+=$b, $tc+ +=$c, $td+=$d, $te+=$e;END{print join "\t", $ta,$tb,$tc,$td,$te}' numb +ers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m6.45s user 0m5.91s sys 0m0.07s

    The last one was consistently faster after several attempts with it and the second.


    Pereant, qui ante nos nostra dixerunt!
      Thanks all for your comments! As expected the looping idea is very slow (Per Random_Walk's results) and I guess we are better off "generating" another perl script with many variables based on the number of columns required. This might not look pretty but seems to be the most efficient way to do it.

      Also, I was curious to see the impact of cut and Perl's split.

      So I tested these two commands/script on 500K file generated using (R's code)...However I output 25 columns instead of 4 and cut out 15 columns for testing

      [sk]% time cut -d, -f"1-15" numbers.csv > out.csv 5.670u 0.340s 0:06.27 95.8%

      [sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o +ut.csv 31.950u 0.200s 0:32.26 99.6%

      I am surprised that Perl's split is *very* slow when compared to UNIX built in cut. Is this because Perl's split does a lot more than the Unix's cut? I see a lot of use cases for Perl in handling large files but if parsing is a bottle neck then I need to be careful on when to use it.

      Thanks again everyone! I enjoyed reading the replies. Esp i liked the explanantions on eof and eof() (very good example to demonstrate the diff) and also the END {} idea :)



      Although still slower than having separate variables for each column, this was a bit faster (and shorter) than your middle test:
      time perl -nle'my $i=0;$a[$i++]+=$_ for (split/,/);END{print join "\t" +, @a}' numbers.csv
      At only about 50% slower than the fastest solution, its brevity and adaptability might recommend it.

      Caution: Contents may have been coded under pressure.
Re: Efficient way to sum columns in a file
by eibwen (Friar) on Apr 13, 2005 at 06:51 UTC
    With regard to your creation of an array and subsequent iteration, it is possible to index the array returned by split and therefore avoid the subsequent iteration:

    $sum += (split(/,/))[5];

    While indexing in this fashion is useful, I think the @F method would be faster in this case.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://447263]
Approved by davido
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2018-06-25 06:48 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.