Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Averaging Elements in Array of Array

by hda (Chaplain)
on Dec 26, 2008 at 13:44 UTC ( [id://732675]=note: print w/replies, xml ) Need Help??


in reply to Averaging Elements in Array of Array

Hi!

Yes, there is a fast way, using PDL (Perl Data language):

http://pdl.perl.org/

Namely you can import your data into a piddle (a PDL object), slice the column and then perform "stats" over that:
use PDL; $piddle = pdl @your_array; $a = slice $piddle, '$column,:'; print stats $a;
Where '$column,:' specifies that you want the whole y range (:) of column $column to be dumped into $a. You can, for example iterate over all columns that you want. other more advanced uses are for sure possible, like taking all averages at the same time, but these are beyond my current knowledge.

Hope this helps

Replies are listed 'Best First'.
Re^2: Averaging Elements in Array of Array
by bruno (Friar) on Dec 26, 2008 at 14:41 UTC
    I thought so too, but here's the benchmark for it:
    #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; use PDL::LiteF; my @data = ( [ 0, 3, 2, 1 ], [ 1, 11, 1, 2 ], [ 5, -2, 0, 1 ], ); # It's not fair to make the conversion every time. my $pdldata = pdl @data; sub using_array { my @data = @_; my @sums; for my $i ( 0 .. $#data ) { $sums[0] += $data[$i][0]; $sums[1] += $data[$i][1]; $sums[2] += $data[$i][2]; $sums[3] += $data[$i][3]; } $sums[$_] /= @data for 0 .. 3; return @sums; } sub using_pdl { my $pdldata = shift; $pdldata /= $pdldata->getdim(1); return $pdldata->transpose->sumover; } cmpthese( 100000, { 'Array-based' => sub { using_array(@data) }, 'PDL-based' => sub { using_pdl($pdldata) }, } );
    Result:
    Rate PDL-based Array-based PDL-based 36496/s -- -67% Array-based 111111/s 204% --
    Apparently, for a dataset of this size (3 by 4) it's not worth it to use PDL. The good thing though, is that it can be applied for a bidimensional piddle of an arbitrary size without modifications of the subroutine.

    I suppose that PDL scales much better though, I've used it for multidimensional piddles of 1e7 elements with a 50-fold increase in speed over a traditional array-based implementation.

      Bruno, you are completely right: in this case the use PDL is justified when working with large arrays. I just supposed that neversaint's array was just an example and that the real problem was a bit more complicated.
      Building the shoulders of giants here's a more generalized benchmarker. It shows that while the PDL approach is slower for a 5x5 matrix, it quickly becomes the choice for speed of computation as the matrix size grows. For example, given a 30x30 matrix one can average the columns of it using PDL method 7 times faster than with conventional methods. Imagine the gains with dimensions in the hundreds or thousands.
      #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; use PDL::LiteF; my @number_of_arrays = qw(5 15 30); my @size_of_arrays = qw(5 15 30); my $iterations = 50000; my $max_integer = 100; benchmark_it( \@number_of_arrays, \@size_of_arrays, $max_integer ); #---------------- sub benchmark_it { my $number_of_arrays = shift; my $size_of_arrays = shift; my $max_random_integer = shift; for my $number ( @{$number_of_arrays} ) { for my $size ( @{size_of_arrays} ) { my $data = build_random_array( $number, $size, $max_random_integer +); my $pdldata = pdl $data; print "Results when number of arrays is $number and size of each array is $s +ize:\n"; cmpthese( $iterations, { 'Array-based' => sub { using_array($data) }, 'PDL-based' => sub { using_pdl($pdldata) }, 'Map-based' => sub { using_map($data) }, } ); print "\n"; } } } sub using_array { my $data = shift; my @sums; my $last_row_index = scalar @{$data} - 1; my $last_column_index = scalar @{ $data->[0] } - 1; for my $i ( 0 .. $last_row_index - 1 ) { for my $j ( 0 .. $last_column_index ) { $sums[$j] += $data->[$i][$j]; } # Hard-coded indices run faster. # $sums[0] += $data[$i][0]; # $sums[1] += $data[$i][1]; # $sums[2] += $data[$i][2]; # $sums[3] += $data[$i][3]; } $sums[$_] /= ( $last_row_index + 1 ) for 0 .. $last_column_index; return @sums; } sub using_map { my $data = shift; my $range_max = scalar @{ $data->[0] } - 1; my @sums; map { for my $j ( 0 .. $range_max ) { $sums[$j] += $_->[$j]; } } @{$data}; return \@sums; } sub using_pdl { my $pdldata = shift; $pdldata /= $pdldata->getdim(1); return $pdldata->transpose->sumover; } sub build_random_array { my $number_of_arrays = shift || 10; my $size_of_arrays = shift || 10; my $max_integer = shift || 100; my $data; foreach my $i ( 1 .. $number_of_arrays ) { my @random_array; push @random_array, int rand( $max_integer + 1 ) for ( 1 .. $size_of_arrays ); push @{$data}, \@random_array; } return $data; } __END__ =head1 Synopsis Compare PDL to more conventional methods of finding the average of the + column vectors in a 2D matrix. =head1 Results My results on December 27, 2008 Results when number of arrays is 5 and size of each array is 5: Rate PDL-based Array-based Map-based PDL-based 42017/s -- -50% -57% Array-based 84746/s 102% -- -14% Map-based 98039/s 133% 16% -- </pre> Results when number of arrays is 30 and size of each array is 30: Rate Array-based Map-based PDL-based Array-based 3987/s -- -17% -89% Map-based 4808/s 21% -- -86% PDL-based 35461/s 789% 638% -- =head1 Notes Note when the matrix is small, 5x5, PDL is slower but as the size of t +he matrix grows, PDL becomes smokin hot from it's speed. It's nice to see the recent development activity with PDL. =cut

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://732675]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-26 02:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found