Improving the Nested For Loop

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need your wisdom to improve my "nested for loop". I am trying to do correlation calculation for about 50,000 categories with each category has 100 values. Here is the snippet of my code:


 #!/tools/bin/perl

use strict;
use warnings;

## Variables and Data Structures
my $count = 1;
my @probesArray
my %probes;
my $size = 100;

## Reading the file
open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n";
while (my $line = <$FILE>){
chomp $line;
my @line = split('\t',$line);
$probes{$line[0]} = [@line[1 .. $#line]]; ## value of the hash as an a
+rray
$probesArray[$count] = $line[0]; ## correlation between 1-2 or 2-1 wil
+l be same so using calculating only once can be done thru this array
$count++;
}
close($FILE); ## Frankly speaking reading of the file takes less than 
+3 sec with 50,000 categories each having 100 values

## Correlation Calculation
for(my $i = 0; $i <= $count-1; $i++){
    for(my $j = $i+1; $j <= $count; $j++){
        my @probe1 = @{$probes{$probesArray[$i]}}; ## Array of data
        my @probe2 = @{$probes{$probesArray[$j]}};
        my $cor = correlation(\@probe1, \@probe2, \$size); ## correlat
+ion is the subroutine
        $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor;
#        print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",
+$cor,"\n";
        $count++; 
    }
}

## Subroutines
sub mean {
    my ($arr1, $arr2, $size) = @_;
       my @arr1 = @$arr1;
       my @arr2 = @$arr2;
       my $mu_x = sum(@arr1) / $$size;
       my $mu_y = sum(@arr2) / $$size;
       return($mu_x,$mu_y);
}
 
## Sum of Squared Deviations to the mean
sub ss {
    my ($arr1, $arr2, $mean_x,$mean_y) = @_;
       my @arr1 = @$arr1;
       my @arr2 = @$arr2;
       my ($ssxx, $ssxy, $ssyy) = (0) x 3;
 
       ## looping over all the samples      
    for(my $i = 0; $i <= scalar(@arr1)-1; $i++){
        $ssxx = $ssxx + ($arr1[$i] - $mean_x)**2;
         $ssxy = $ssxy + ($arr1[$i] - $mean_x)*($arr2[$i] - $mean_y) ;
         $ssyy = $ssyy + ($arr2[$i] - $mean_y)**2;
    }
    return ($ssxx, $ssxy, $ssyy);
}

## Pearson Correlation Coefficient
sub correlation {
    my ($arr1, $arr2, $size) = @_;
       my ($mean_x,$mean_y) = mean($arr1, $arr2, $size);
       my ($ssxx, $ssxy, $ssyy) = ss($arr1, $arr2, $mean_x, $mean_y);
       my $cor = $ssxy/sqrt($ssxx*$ssyy);
       return($cor);
}
[download]

Correlation Calculation is taking a lot of time. I mean calculating correlation between category1 vs all (i.e. 50000) is taking around 40 sec. Now, if that is the speed then, it will take more than 12 days to calculate the correlation coefficient for all the 50k categories. Nested for loop is taking a lot of time. Is there a way to decrease the run time or am I doing something wrong.

Any hints or advice would be highly appreciated. Thanks!

Comment on Improving the Nested For Loop Download Code

Replies are listed 'Best First'.
Re: Improving the Nested For Loop by BrowserUk (Patriarch) on Sep 02, 2014 at 06:12 UTC
You are doing an awful lot of completely unnecessary copying of arrays. Try this clean-compiling (unlike your posted code), but untested version, It should be substantially faster: #! perl -slw use strict; ## Variables and Data Structures my $count = 1; my @probesArray; my %probes; my $size = 100; ## Reading the file open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (my $line = <$FILE>){ chomp $line; my @line = split '\t', $line; $probes{ $line[0] } = [ @line[ 1 .. $#line ] ]; ## value of the ha +sh as an array ## correlation between 1-2 or 2-1 will be same so using calculatin +g only once can be done thru this array $probesArray[ $count ] = $line[ 0 ]; $count++; } ## Frankly speaking reading of the file takes less than 3 sec with 50, +000 categories each having 100 values close $FILE; ## Correlation Calculation for( my $i = 0; $i <= $count-1; $i++ ){ for( my $j = $i+1; $j <= $count; $j++ ){ my $cor = correlation( $probes{ $probesArray[$i] }, $probes{ $ +probesArray[$j] }, $size ); ## correlation is the subroutine $probes{ $probesArray[$i] . "-" . $probesArray[$j] } = $cor; # print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",$ +cor,"\n"; $count++; } } ## Subroutines ## Pearson Correlation Coefficient sub correlation { my( $arr1, $arr2, $size ) = @_; my( $mean_x, $mean_y ) = mean( $arr1, $arr2, $size ); my( $ssxx, $ssxy, $ssyy ) = ss( $arr1, $arr2, $mean_x, $mean_y ); my $cor = $ssxy / sqrt( $ssxx * $ssyy ); return $cor ; } sub mean { my( $arr1, $arr2, $size ) = @_; my $mu_x = sum( @$arr1 ) / $size; my $mu_y = sum( @$arr2 ) / $size; return($mu_x,$mu_y); } ## Sum of Squared Deviations to the mean sub ss { my( $arr1, $arr2, $mean_x, $mean_y ) = @_; my( $ssxx, $ssxy, $ssyy ) = (0) x 3; ## looping over all the samples for( my $i = 0; $i < @$arr1; $i++ ){ $ssxx = $ssxx + ( $arr1->[$i] - $mean_x )*2; $ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) ( $arr2->[$i] - $m +ean_y ); $ssyy = $ssyy + ( $arr2->[$i] - $mean_y )**2; } return ($ssxx, $ssxy, $ssyy); } [download] Ask questions about anything you do not understand. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Improving the Nested For Loop by moritz (Cardinal) on Sep 02, 2014 at 07:33 UTC
PDL::Stats::Basic can calculate correlations for you. And it probably does it faster than you do, so give it a try! Perl 6 - the future is here, just unevenly distributed	[reply]
Re: Improving the Nested For Loop by CountZero (Bishop) on Sep 02, 2014 at 06:37 UTC
You are calling the correlation sub 50_000 times 50_000 divided by 2 (I think): 11_250_000_000 calls in total. That is quiet a lot so any change to your `correlation` sub will go a long way. I see you start with some array-references which you de-reference (turn back into an array), which you then reference again in the call to `correlation` and passing these references in the calls to `mean` and `ss`. And in those subs, you de-reference these references again and put everyting in an array again. I suggest that you keep working with references all the way through and never copy them into an array. This is less readable and you have to take care that you do not change any of the values of those arrays. I think you are OK on this point. Another speed-up is looking for faster calculation of the correlation. Perhaps there is an XS module somewhere on CPAN that does this faster. Did you check? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l] [select]
Re: Improving the Nested For Loop by hdb (Monsignor) on Sep 02, 2014 at 06:52 UTC
In addition what the fellow monks advised above already, you do calculate the mean and the standard deviation of each category many times. You can save some time by doing it only once and store the results (e.g. while loading the data). You could also standardize your data (by subtracting the mean and dividing by the standard deviation) which also takes out the repeated calculation of mean and standard deviation in your nested loop.	[reply]
Re^2: Improving the Nested For Loop by CountZero (Bishop) on Sep 02, 2014 at 21:31 UTC
The Memoize module would be a good and easy way to avoid repeated calculations. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Improving the Nested For Loop by Laurent_R (Canon) on Sep 02, 2014 at 06:12 UTC
Hmm, you're doing a lot of computing, this is bound to take quite a bit of time. I would suggest that you start by profiling your program to figure out exactly which part of your code is taking long time. You might want to use a tool such a Devel::NYTProf.	[reply]
Re: Improving the Nested For Loop by Cristoforo (Curate) on Sep 02, 2014 at 22:57 UTC
This (untested) approach uses a cache to get the Sum of Squared Deviations to the mean one time. Your approach does this calculation for each call to the correlation sub. So this saves about 11 billion or more (more I think :-) )calculations. #!/tools/bin/perl use strict; use warnings; use List::Util 'sum'; my @keys; my $probes; open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (<$FILE>){ chomp; my ($key, @line) = split /\t/; $probes->{$key}{vals} = \@line; $probes->{$key}{mean} = sum(@line) / @line; push @keys, $key; } close($FILE) or die $!; my $cache; ## Sum of Squared Deviations to the mean for my $key (@keys) { my $mean = $probes->{$key}{mean}; my $ss; for my $dat (@{ $probes->{$key}{vals} }) { $ss += ($dat - $mean) ** 2; } $cache->{$key} = $ss; } ## Correlation Calculation my $count = 1; for my $i (0 .. $#keys){ for my $j ($i+1 .. $#keys){ my $cor = correlation($probes, $cache, @keys[$i, $j]); # $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor; print $count++,"\t", "$keys[$i]-$keys[$j]\t$cor\n"; } } ## Sum of Squared Deviations to the mean sub correlation { my ($probes, $cache, $key1, $key2) = @_; my $arr1 = $probes->{$key1}{vals}; my $arr2 = $probes->{$key2}{vals}; my $mean_x = $probes->{$key1}{mean}; my $mean_y = $probes->{$key2}{mean}; my ($ssxx, $ssxy, $ssyy) = ($cache->{$key1}, 0, $cache->{$key2}); for(my $i = 0; $i < @$arr1; $i++){ $ssxy += ($arr1->[$i] - $mean_x)($arr2->[$i] - $mean_y) ; } return $ssxy/sqrt( $ssxx $ssyy ); } [download] Update: as moritz suggested, PDL::Stats::Basic might provide the optimal solution. My response is using plain perl.	[reply] [d/l]


Perl-Sensitive Sunglasses
	PerlMonks