neversaint has asked for the wisdom of the Perl Monks concerning the following question:
Dear Masters,
Given an array of array like this (in practice
there are millions of such sets):
my @aoa = (
[0, 3, 2, 1],
[1, 11, 1, 2],
[5, 2, 0, 1],
);
is there a fast way to compute the average
of value from element of same index in the subarray.
Resulting:
my @res = (2, 4, 1, 1.333);

neversaint and everlastingly indebted.......
Re: Averaging Elements in Array of Array
by ikegami (Pope) on Dec 26, 2008 at 10:01 UTC

my @sum;
for my $row (@aoa) {
for (@{$aoa[$row]}) {
$sum[$col] += $_;
}
}
for (@sum) {
$_ /= @aoa;
}
Update: Changed $sum /= @aoa; to $_ /= @aoa;  [reply] [d/l] [select] 
Re: Averaging Elements in Array of Array
by hda (Hermit) on Dec 26, 2008 at 13:44 UTC

Hi!
Yes, there is a fast way, using PDL (Perl Data language):
http://pdl.perl.org/
Namely you can import your data into a piddle (a PDL object), slice the column and then perform "stats" over that:
use PDL;
$piddle = pdl @your_array;
$a = slice $piddle, '$column,:';
print stats $a;
Where '$column,:' specifies that you want the whole y range (:) of column $column to be dumped into $a. You can, for example iterate over all columns that you want. other more advanced uses are for sure possible, like taking all averages at the same time, but these are beyond my current knowledge.
Hope this helps  [reply] [d/l] 

I thought so too, but here's the benchmark for it:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw/cmpthese/;
use PDL::LiteF;
my @data = ( [ 0, 3, 2, 1 ], [ 1, 11, 1, 2 ], [ 5, 2, 0, 1 ], );
# It's not fair to make the conversion every time.
my $pdldata = pdl @data;
sub using_array {
my @data = @_;
my @sums;
for my $i ( 0 .. $#data ) {
$sums[0] += $data[$i][0];
$sums[1] += $data[$i][1];
$sums[2] += $data[$i][2];
$sums[3] += $data[$i][3];
}
$sums[$_] /= @data for 0 .. 3;
return @sums;
}
sub using_pdl {
my $pdldata = shift;
$pdldata /= $pdldata>getdim(1);
return $pdldata>transpose>sumover;
}
cmpthese(
100000,
{ 'Arraybased' => sub { using_array(@data) },
'PDLbased' => sub { using_pdl($pdldata) },
}
);
Result:
Rate PDLbased Arraybased
PDLbased 36496/s  67%
Arraybased 111111/s 204% 
Apparently, for a dataset of this size (3 by 4) it's not worth it to use PDL. The good thing though, is that it can be applied for a bidimensional piddle of an arbitrary size without modifications of the subroutine. I suppose that PDL scales much better though, I've used it for multidimensional piddles of 1e7 elements with a 50fold increase in speed over a traditional arraybased implementation.  [reply] [d/l] [select] 

Bruno, you are completely right: in this case the use PDL is justified when working with large arrays. I just supposed that neversaint's array was just an example and that the real problem was a bit more complicated.
 [reply] 

Building the shoulders of giants here's a more generalized benchmarker. It shows that while the PDL approach is slower for a 5x5 matrix, it quickly becomes the choice for speed of computation as the matrix size grows. For example, given a 30x30 matrix one can average the columns of it using PDL method 7 times faster than with conventional methods. Imagine the gains with dimensions in the hundreds or thousands.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw/cmpthese/;
use PDL::LiteF;
my @number_of_arrays = qw(5 15 30);
my @size_of_arrays = qw(5 15 30);
my $iterations = 50000;
my $max_integer = 100;
benchmark_it( \@number_of_arrays, \@size_of_arrays, $max_integer );
#
sub benchmark_it {
my $number_of_arrays = shift;
my $size_of_arrays = shift;
my $max_random_integer = shift;
for my $number ( @{$number_of_arrays} ) {
for my $size ( @{size_of_arrays} ) {
my $data =
build_random_array( $number, $size, $max_random_integer
+);
my $pdldata = pdl $data;
print
"Results when number of arrays is $number and size of each array is $s
+ize:\n";
cmpthese(
$iterations,
{
'Arraybased' => sub { using_array($data) },
'PDLbased' => sub { using_pdl($pdldata) },
'Mapbased' => sub { using_map($data) },
}
);
print "\n";
}
}
}
sub using_array {
my $data = shift;
my @sums;
my $last_row_index = scalar @{$data}  1;
my $last_column_index = scalar @{ $data>[0] }  1;
for my $i ( 0 .. $last_row_index  1 ) {
for my $j ( 0 .. $last_column_index ) {
$sums[$j] += $data>[$i][$j];
}
# Hardcoded indices run faster.
# $sums[0] += $data[$i][0];
# $sums[1] += $data[$i][1];
# $sums[2] += $data[$i][2];
# $sums[3] += $data[$i][3];
}
$sums[$_] /= ( $last_row_index + 1 ) for 0 .. $last_column_index;
return @sums;
}
sub using_map {
my $data = shift;
my $range_max = scalar @{ $data>[0] }  1;
my @sums;
map {
for my $j ( 0 .. $range_max )
{
$sums[$j] += $_>[$j];
}
} @{$data};
return \@sums;
}
sub using_pdl {
my $pdldata = shift;
$pdldata /= $pdldata>getdim(1);
return $pdldata>transpose>sumover;
}
sub build_random_array {
my $number_of_arrays = shift  10;
my $size_of_arrays = shift  10;
my $max_integer = shift  100;
my $data;
foreach my $i ( 1 .. $number_of_arrays ) {
my @random_array;
push @random_array, int rand( $max_integer + 1 )
for ( 1 .. $size_of_arrays );
push @{$data}, \@random_array;
}
return $data;
}
__END__
=head1 Synopsis
Compare PDL to more conventional methods of finding the average of the
+ column
vectors in a 2D matrix.
=head1 Results
My results on December 27, 2008
Results when number of arrays is 5 and size of each array is 5:
Rate PDLbased Arraybased Mapbased
PDLbased 42017/s  50% 57%
Arraybased 84746/s 102%  14%
Mapbased 98039/s 133% 16% 
</pre>
Results when number of arrays is 30 and size of each array is 30:
Rate Arraybased Mapbased PDLbased
Arraybased 3987/s  17% 89%
Mapbased 4808/s 21%  86%
PDLbased 35461/s 789% 638% 
=head1 Notes
Note when the matrix is small, 5x5, PDL is slower but as the size of t
+he matrix grows, PDL becomes smokin hot from it's speed.
It's nice to see the recent development activity with PDL.
=cut
 [reply] [d/l] 

Re: Averaging Elements in Array of Array
by JavaFan (Canon) on Dec 26, 2008 at 12:16 UTC

 [reply] 

my @sums;
for my $i ( 0 .. $#data ) {
$sums[ 0 ] += $data[$i][ 0 ];
$sums[ 1 ] += $data[$i][ 1 ];
$sums[ 2 ] += $data[$i][ 2 ];
$sums[ 3 ] += $data[$i][ 3 ];
}
$sums[ $_ ] /= @data for 0 .. 3;
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] 

 [reply] 




Re: Averaging Elements in Array of Array
by Anonymous Monk on Dec 27, 2008 at 00:45 UTC

you could try something like this
for(my $i=0;$i< scalar(@aoa);$i++) {
for(my $x=0;$x < scalar($aoa$i);$x++) {
$hash{$i} = $hash{$i} + $aoa$i$x;
}
}
my @res = ();
foreach my $key (sort keys %hash) {
push(@res,(($hash{$key}) / 4));
}
If you don't know the length of the aoa then you could make the '4' in '/ 4' into a var. Or you could do a hash that has more of a struct feel to it like:
$hash{$i}{'total'} = <total of the one column of values>
$hash{$i}{'array_length'} = #
Not sure how much faster this would be than what you have, but it should be fairly fast since you only go through the array of arrays once and then compute the average once.
George in NC  [reply] 

This is a slower and less readable version of what I posted 15 hours ago.
 $aoa[$i] (in $x < scalar($aoa[$i])) is buggy. It should be @{ $aoa[$i] }
 Your code (specifically your use of sort) fails when there are more than 10 columns in each row.
 Why do you hardcode "4" in one place but use "scalar(@aoa)" in another? That should tell you something's wrong.
 There's no reason to use a hash here. It's not appropriate, and it just slows things down for nothing.
 "hash" is an awful name for a variable.
 $x = $x + $y; can be written more concisely as $x += $y;.
 scalar(@aoa) can be written more concisely as @aoa if the context is already scalar.
 for (my $i=0; $i<@aoa; $i++) can be written more concisely as for my $i (0..$#aoa).
Finally, your post formatting is buggy. Put code and other preformatted data in <c>...</c> tags.
 [reply] [d/l] [select] 

I apologize.. it was more of an idea than an actual peice
of code. I didn't see it written anywhere in the other
peices, so i thought i'd throw it out there. If someone is
interested in finding the best way to parse something
then it's best to see or try as many options as possible before determining what is best.
The code can definately be more concise or made better and no, I wouldn't use the specific code in a program, but it would be similar. The '4' along with any other part that is not to your liking can be modified depending on the needs of the app or desire of the programmer. That's the great thing about perl... there are twenty ways to do something and twenty diff opinions on what is best...
 [reply] 

