Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Efficient way to sum columns in a file

by Random_Walk (Prior)
on Apr 13, 2005 at 12:58 UTC ( #447353=note: print w/replies, xml ) Need Help??

in reply to Efficient way to sum columns in a file

I generated 500,000 lines of random CSV with this script

#!/usr/bin/perl use strict; use warnings; # create source numbers if they don't exist my $many = 500000; my $source='numbers.csv'; open CSV, '>', $source or die "can't write to $source: $!\n"; for (1..$many) { my @numbers; push @numbers, (int rand 1000) for (1..5); print CSV join ",",@numbers; print CSV $/; }

Then I tried a few one liners to sum the columns, I ran each twice and post the second timing to allow for cache

nph>time cat numbers.csv | perl -nle'@d=split /,/;$a[$_]+=$d[$_] for ( +0..4);END{print join "\t", @a}' 249959157 249671314 249649377 250057435 249420 +634 real 0m17.10s user 0m15.46s sys 0m0.08s nph>time perl -nle'my @d=split /,/;$a[$_]+=$d[$_] for (0..4);END{print + join "\t", @a}' numbers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m13.71s user 0m12.77s sys 0m0.04s nph>time perl -nle'my($a,$b,$c,$d,$e)=split /,/;$ta+=$a, $tb+=$b, $tc+ +=$c, $td+=$d, $te+=$e;END{print join "\t", $ta,$tb,$tc,$td,$te}' numb +ers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m6.45s user 0m5.91s sys 0m0.07s

The last one was consistently faster after several attempts with it and the second.


Pereant, qui ante nos nostra dixerunt!

Replies are listed 'Best First'.
Re^2: Efficient way to sum columns in a file
by sk (Curate) on Apr 13, 2005 at 18:12 UTC
    Thanks all for your comments! As expected the looping idea is very slow (Per Random_Walk's results) and I guess we are better off "generating" another perl script with many variables based on the number of columns required. This might not look pretty but seems to be the most efficient way to do it.

    Also, I was curious to see the impact of cut and Perl's split.

    So I tested these two commands/script on 500K file generated using (R's code)...However I output 25 columns instead of 4 and cut out 15 columns for testing

    [sk]% time cut -d, -f"1-15" numbers.csv > out.csv 5.670u 0.340s 0:06.27 95.8%

    [sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o +ut.csv 31.950u 0.200s 0:32.26 99.6%

    I am surprised that Perl's split is *very* slow when compared to UNIX built in cut. Is this because Perl's split does a lot more than the Unix's cut? I see a lot of use cases for Perl in handling large files but if parsing is a bottle neck then I need to be careful on when to use it.

    Thanks again everyone! I enjoyed reading the replies. Esp i liked the explanantions on eof and eof() (very good example to demonstrate the diff) and also the END {} idea :)



Re^2: Efficient way to sum columns in a file
by Roy Johnson (Monsignor) on Apr 13, 2005 at 20:40 UTC
    Although still slower than having separate variables for each column, this was a bit faster (and shorter) than your middle test:
    time perl -nle'my $i=0;$a[$i++]+=$_ for (split/,/);END{print join "\t" +, @a}' numbers.csv
    At only about 50% slower than the fastest solution, its brevity and adaptability might recommend it.

    Caution: Contents may have been coded under pressure.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://447353]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2017-10-18 09:54 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (244 votes). Check out past polls.