Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Averaging Elements in Array of Array

by neversaint (Deacon)
on Dec 26, 2008 at 09:54 UTC ( #732648=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
Given an array of array like this (in practice there are millions of such sets):
my @aoa = ( [0, 3, 2, 1], [1, 11, 1, 2], [5, -2, 0, 1], );
is there a fast way to compute the average of value from element of same index in the sub-array. Resulting:
my @res = (2, 4, 1, 1.333);


---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Averaging Elements in Array of Array
by ikegami (Pope) on Dec 26, 2008 at 10:01 UTC
    my @sum; for my $row (@aoa) { for (@{$aoa[$row]}) { $sum[$col] += $_; } } for (@sum) { $_ /= @aoa; }

    Update: Changed $sum /= @aoa; to $_ /= @aoa;

Re: Averaging Elements in Array of Array
by hda (Hermit) on Dec 26, 2008 at 13:44 UTC
    Hi!

    Yes, there is a fast way, using PDL (Perl Data language):

    http://pdl.perl.org/

    Namely you can import your data into a piddle (a PDL object), slice the column and then perform "stats" over that:
    use PDL; $piddle = pdl @your_array; $a = slice $piddle, '$column,:'; print stats $a;
    Where '$column,:' specifies that you want the whole y range (:) of column $column to be dumped into $a. You can, for example iterate over all columns that you want. other more advanced uses are for sure possible, like taking all averages at the same time, but these are beyond my current knowledge.

    Hope this helps
      I thought so too, but here's the benchmark for it:
      #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; use PDL::LiteF; my @data = ( [ 0, 3, 2, 1 ], [ 1, 11, 1, 2 ], [ 5, -2, 0, 1 ], ); # It's not fair to make the conversion every time. my $pdldata = pdl @data; sub using_array { my @data = @_; my @sums; for my $i ( 0 .. $#data ) { $sums[0] += $data[$i][0]; $sums[1] += $data[$i][1]; $sums[2] += $data[$i][2]; $sums[3] += $data[$i][3]; } $sums[$_] /= @data for 0 .. 3; return @sums; } sub using_pdl { my $pdldata = shift; $pdldata /= $pdldata->getdim(1); return $pdldata->transpose->sumover; } cmpthese( 100000, { 'Array-based' => sub { using_array(@data) }, 'PDL-based' => sub { using_pdl($pdldata) }, } );
      Result:
      Rate PDL-based Array-based PDL-based 36496/s -- -67% Array-based 111111/s 204% --
      Apparently, for a dataset of this size (3 by 4) it's not worth it to use PDL. The good thing though, is that it can be applied for a bidimensional piddle of an arbitrary size without modifications of the subroutine.

      I suppose that PDL scales much better though, I've used it for multidimensional piddles of 1e7 elements with a 50-fold increase in speed over a traditional array-based implementation.

        Bruno, you are completely right: in this case the use PDL is justified when working with large arrays. I just supposed that neversaint's array was just an example and that the real problem was a bit more complicated.
        Building the shoulders of giants here's a more generalized benchmarker. It shows that while the PDL approach is slower for a 5x5 matrix, it quickly becomes the choice for speed of computation as the matrix size grows. For example, given a 30x30 matrix one can average the columns of it using PDL method 7 times faster than with conventional methods. Imagine the gains with dimensions in the hundreds or thousands.
        #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; use PDL::LiteF; my @number_of_arrays = qw(5 15 30); my @size_of_arrays = qw(5 15 30); my $iterations = 50000; my $max_integer = 100; benchmark_it( \@number_of_arrays, \@size_of_arrays, $max_integer ); #---------------- sub benchmark_it { my $number_of_arrays = shift; my $size_of_arrays = shift; my $max_random_integer = shift; for my $number ( @{$number_of_arrays} ) { for my $size ( @{size_of_arrays} ) { my $data = build_random_array( $number, $size, $max_random_integer +); my $pdldata = pdl $data; print "Results when number of arrays is $number and size of each array is $s +ize:\n"; cmpthese( $iterations, { 'Array-based' => sub { using_array($data) }, 'PDL-based' => sub { using_pdl($pdldata) }, 'Map-based' => sub { using_map($data) }, } ); print "\n"; } } } sub using_array { my $data = shift; my @sums; my $last_row_index = scalar @{$data} - 1; my $last_column_index = scalar @{ $data->[0] } - 1; for my $i ( 0 .. $last_row_index - 1 ) { for my $j ( 0 .. $last_column_index ) { $sums[$j] += $data->[$i][$j]; } # Hard-coded indices run faster. # $sums[0] += $data[$i][0]; # $sums[1] += $data[$i][1]; # $sums[2] += $data[$i][2]; # $sums[3] += $data[$i][3]; } $sums[$_] /= ( $last_row_index + 1 ) for 0 .. $last_column_index; return @sums; } sub using_map { my $data = shift; my $range_max = scalar @{ $data->[0] } - 1; my @sums; map { for my $j ( 0 .. $range_max ) { $sums[$j] += $_->[$j]; } } @{$data}; return \@sums; } sub using_pdl { my $pdldata = shift; $pdldata /= $pdldata->getdim(1); return $pdldata->transpose->sumover; } sub build_random_array { my $number_of_arrays = shift || 10; my $size_of_arrays = shift || 10; my $max_integer = shift || 100; my $data; foreach my $i ( 1 .. $number_of_arrays ) { my @random_array; push @random_array, int rand( $max_integer + 1 ) for ( 1 .. $size_of_arrays ); push @{$data}, \@random_array; } return $data; } __END__ =head1 Synopsis Compare PDL to more conventional methods of finding the average of the + column vectors in a 2D matrix. =head1 Results My results on December 27, 2008 Results when number of arrays is 5 and size of each array is 5: Rate PDL-based Array-based Map-based PDL-based 42017/s -- -50% -57% Array-based 84746/s 102% -- -14% Map-based 98039/s 133% 16% -- </pre> Results when number of arrays is 30 and size of each array is 30: Rate Array-based Map-based PDL-based Array-based 3987/s -- -17% -89% Map-based 4808/s 21% -- -86% PDL-based 35461/s 789% 638% -- =head1 Notes Note when the matrix is small, 5x5, PDL is slower but as the size of t +he matrix grows, PDL becomes smokin hot from it's speed. It's nice to see the recent development activity with PDL. =cut
Re: Averaging Elements in Array of Array
by JavaFan (Canon) on Dec 26, 2008 at 12:16 UTC
    Well, it's trivial to do it in linear time, and obviously, you have to consider all elements, so there isn't much to win here.

    Except perhaps doing it all in C.

      There are some gains to be had. This'll run in about 1/3rd the time of ikegami's version above for 1e6 elements, and proportionally more as the size increases:

      my @sums; for my $i ( 0 .. $#data ) { $sums[ 0 ] += $data[$i][ 0 ]; $sums[ 1 ] += $data[$i][ 1 ]; $sums[ 2 ] += $data[$i][ 2 ]; $sums[ 3 ] += $data[$i][ 3 ]; } $sums[ $_ ] /= @data for 0 .. 3;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Dear BrowserUk,
        Your method is fast indeed. Is there a way I can modify your code so that in can incorporate any length of sub-array size? (i.e > 4 elements).

        ---
        neversaint and everlastingly indebted.......
Re: Averaging Elements in Array of Array
by Anonymous Monk on Dec 27, 2008 at 00:45 UTC
    you could try something like this
    for(my $i=0;$i< scalar(@aoa);$i++) {
    for(my $x=0;$x < scalar($aoa$i);$x++) {
    $hash{$i} = $hash{$i} + $aoa$i$x;
    }
    }

    my @res = ();
    foreach my $key (sort keys %hash) {
    push(@res,(($hash{$key}) / 4));
    }
    If you don't know the length of the aoa then you could make the '4' in '/ 4' into a var. Or you could do a hash that has more of a struct feel to it like:
    $hash{$i}{'total'} = <total of the one column of values>
    $hash{$i}{'array_length'} = #

    Not sure how much faster this would be than what you have, but it should be fairly fast since you only go through the array of arrays once and then compute the average once. George in NC

      This is a slower and less readable version of what I posted 15 hours ago.

      • $aoa[$i] (in $x < scalar($aoa[$i])) is buggy. It should be @{ $aoa[$i] }
      • Your code (specifically your use of sort) fails when there are more than 10 columns in each row.
      • Why do you hardcode "4" in one place but use "scalar(@aoa)" in another? That should tell you something's wrong.
      • There's no reason to use a hash here. It's not appropriate, and it just slows things down for nothing.
      • "hash" is an awful name for a variable.
      • $x = $x + $y; can be written more concisely as $x += $y;.
      • scalar(@aoa) can be written more concisely as @aoa if the context is already scalar.
      • for (my $i=0; $i<@aoa; $i++) can be written more concisely as for my $i (0..$#aoa).

      Finally, your post formatting is buggy. Put code and other preformatted data in <c>...</c> tags.

        I apologize.. it was more of an idea than an actual peice of code. I didn't see it written anywhere in the other peices, so i thought i'd throw it out there. If someone is interested in finding the best way to parse something then it's best to see or try as many options as possible before determining what is best. The code can definately be more concise or made better and no, I wouldn't use the specific code in a program, but it would be similar. The '4' along with any other part that is not to your liking can be modified depending on the needs of the app or desire of the programmer. That's the great thing about perl... there are twenty ways to do something and twenty diff opinions on what is best...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://732648]
Approved by Corion
Front-paged by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2020-02-21 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (97 votes). Check out past polls.

    Notices?