Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
I need your wisdom to improve my "nested for loop". I am trying to do correlation calculation for about 50,000 categories with each category has 100 values. Here is the snippet of my code:
#!/tools/bin/perl
use strict;
use warnings;
## Variables and Data Structures
my $count = 1;
my @probesArray
my %probes;
my $size = 100;
## Reading the file
open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n";
while (my $line = <$FILE>){
chomp $line;
my @line = split('\t',$line);
$probes{$line[0]} = [@line[1 .. $#line]]; ## value of the hash as an a
+rray
$probesArray[$count] = $line[0]; ## correlation between 1-2 or 2-1 wil
+l be same so using calculating only once can be done thru this array
$count++;
}
close($FILE); ## Frankly speaking reading of the file takes less than
+3 sec with 50,000 categories each having 100 values
## Correlation Calculation
for(my $i = 0; $i <= $count-1; $i++){
for(my $j = $i+1; $j <= $count; $j++){
my @probe1 = @{$probes{$probesArray[$i]}}; ## Array of data
my @probe2 = @{$probes{$probesArray[$j]}};
my $cor = correlation(\@probe1, \@probe2, \$size); ## correlat
+ion is the subroutine
$calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor;
# print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",
+$cor,"\n";
$count++;
}
}
## Subroutines
sub mean {
my ($arr1, $arr2, $size) = @_;
my @arr1 = @$arr1;
my @arr2 = @$arr2;
my $mu_x = sum(@arr1) / $$size;
my $mu_y = sum(@arr2) / $$size;
return($mu_x,$mu_y);
}
## Sum of Squared Deviations to the mean
sub ss {
my ($arr1, $arr2, $mean_x,$mean_y) = @_;
my @arr1 = @$arr1;
my @arr2 = @$arr2;
my ($ssxx, $ssxy, $ssyy) = (0) x 3;
## looping over all the samples
for(my $i = 0; $i <= scalar(@arr1)-1; $i++){
$ssxx = $ssxx + ($arr1[$i] - $mean_x)**2;
$ssxy = $ssxy + ($arr1[$i] - $mean_x)*($arr2[$i] - $mean_y) ;
$ssyy = $ssyy + ($arr2[$i] - $mean_y)**2;
}
return ($ssxx, $ssxy, $ssyy);
}
## Pearson Correlation Coefficient
sub correlation {
my ($arr1, $arr2, $size) = @_;
my ($mean_x,$mean_y) = mean($arr1, $arr2, $size);
my ($ssxx, $ssxy, $ssyy) = ss($arr1, $arr2, $mean_x, $mean_y);
my $cor = $ssxy/sqrt($ssxx*$ssyy);
return($cor);
}
Correlation Calculation is taking a lot of time. I mean calculating correlation between category1 vs all (i.e. 50000) is taking around 40 sec. Now, if that is the speed then, it will take more than 12 days to calculate the correlation coefficient for all the 50k categories. Nested for loop is taking a lot of time. Is there a way to decrease the run time or am I doing something wrong.
Any hints or advice would be highly appreciated. Thanks!
Re: Improving the Nested For Loop
by BrowserUk (Patriarch) on Sep 02, 2014 at 06:12 UTC
|
You are doing an awful lot of completely unnecessary copying of arrays.
Try this clean-compiling (unlike your posted code), but untested version, It should be substantially faster:
#! perl -slw
use strict;
## Variables and Data Structures
my $count = 1;
my @probesArray;
my %probes;
my $size = 100;
## Reading the file
open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n";
while (my $line = <$FILE>){
chomp $line;
my @line = split '\t', $line;
$probes{ $line[0] } = [ @line[ 1 .. $#line ] ]; ## value of the ha
+sh as an array
## correlation between 1-2 or 2-1 will be same so using calculatin
+g only once can be done thru this array
$probesArray[ $count ] = $line[ 0 ];
$count++;
}
## Frankly speaking reading of the file takes less than 3 sec with 50,
+000 categories each having 100 values
close $FILE;
## Correlation Calculation
for( my $i = 0; $i <= $count-1; $i++ ){
for( my $j = $i+1; $j <= $count; $j++ ){
my $cor = correlation( $probes{ $probesArray[$i] }, $probes{ $
+probesArray[$j] }, $size ); ## correlation is the subroutine
$probes{ $probesArray[$i] . "-" . $probesArray[$j] } = $cor;
# print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",$
+cor,"\n";
$count++;
}
}
## Subroutines
## Pearson Correlation Coefficient
sub correlation {
my( $arr1, $arr2, $size ) = @_;
my( $mean_x, $mean_y ) = mean( $arr1, $arr2, $size );
my( $ssxx, $ssxy, $ssyy ) = ss( $arr1, $arr2, $mean_x, $mean_y );
my $cor = $ssxy / sqrt( $ssxx * $ssyy );
return $cor ;
}
sub mean {
my( $arr1, $arr2, $size ) = @_;
my $mu_x = sum( @$arr1 ) / $size;
my $mu_y = sum( @$arr2 ) / $size;
return($mu_x,$mu_y);
}
## Sum of Squared Deviations to the mean
sub ss {
my( $arr1, $arr2, $mean_x, $mean_y ) = @_;
my( $ssxx, $ssxy, $ssyy ) = (0) x 3;
## looping over all the samples
for( my $i = 0; $i < @$arr1; $i++ ){
$ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
$ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $m
+ean_y );
$ssyy = $ssyy + ( $arr2->[$i] - $mean_y )**2;
}
return ($ssxx, $ssxy, $ssyy);
}
Ask questions about anything you do not understand.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Improving the Nested For Loop
by moritz (Cardinal) on Sep 02, 2014 at 07:33 UTC
|
PDL::Stats::Basic can calculate correlations for you. And it probably does it faster than you do, so give it a try!
| [reply] |
Re: Improving the Nested For Loop
by CountZero (Bishop) on Sep 02, 2014 at 06:37 UTC
|
You are calling the correlation sub 50_000 times 50_000 divided by 2 (I think): 11_250_000_000 calls in total. That is quiet a lot so any change to your correlation sub will go a long way.I see you start with some array-references which you de-reference (turn back into an array), which you then reference again in the call to correlation and passing these references in the calls to mean and ss. And in those subs, you de-reference these references again and put everyting in an array again. I suggest that you keep working with references all the way through and never copy them into an array. This is less readable and you have to take care that you do not change any of the values of those arrays. I think you are OK on this point. Another speed-up is looking for faster calculation of the correlation. Perhaps there is an XS module somewhere on CPAN that does this faster. Did you check?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] [select] |
Re: Improving the Nested For Loop
by hdb (Monsignor) on Sep 02, 2014 at 06:52 UTC
|
In addition what the fellow monks advised above already, you do calculate the mean and the standard deviation of each category many times. You can save some time by doing it only once and store the results (e.g. while loading the data). You could also standardize your data (by subtracting the mean and dividing by the standard deviation) which also takes out the repeated calculation of mean and standard deviation in your nested loop.
| [reply] |
|
The Memoize module would be a good and easy way to avoid repeated calculations.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
Re: Improving the Nested For Loop
by Laurent_R (Canon) on Sep 02, 2014 at 06:12 UTC
|
Hmm, you're doing a lot of computing, this is bound to take quite a bit of time. I would suggest that you start by profiling your program to figure out exactly which part of your code is taking long time. You might want to use a tool such a Devel::NYTProf.
| [reply] |
Re: Improving the Nested For Loop
by Cristoforo (Curate) on Sep 02, 2014 at 22:57 UTC
|
This (untested) approach uses a cache to get the Sum of Squared Deviations to the mean one time. Your approach does this calculation for each call to the correlation sub. So this saves about 11 billion or more (more I think :-) )calculations.
#!/tools/bin/perl
use strict;
use warnings;
use List::Util 'sum';
my @keys;
my $probes;
open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n";
while (<$FILE>){
chomp;
my ($key, @line) = split /\t/;
$probes->{$key}{vals} = \@line;
$probes->{$key}{mean} = sum(@line) / @line;
push @keys, $key;
}
close($FILE) or die $!;
my $cache;
## Sum of Squared Deviations to the mean
for my $key (@keys) {
my $mean = $probes->{$key}{mean};
my $ss;
for my $dat (@{ $probes->{$key}{vals} }) {
$ss += ($dat - $mean) ** 2;
}
$cache->{$key} = $ss;
}
## Correlation Calculation
my $count = 1;
for my $i (0 .. $#keys){
for my $j ($i+1 .. $#keys){
my $cor = correlation($probes, $cache, @keys[$i, $j]);
# $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor;
print $count++,"\t", "$keys[$i]-$keys[$j]\t$cor\n";
}
}
## Sum of Squared Deviations to the mean
sub correlation {
my ($probes, $cache, $key1, $key2) = @_;
my $arr1 = $probes->{$key1}{vals};
my $arr2 = $probes->{$key2}{vals};
my $mean_x = $probes->{$key1}{mean};
my $mean_y = $probes->{$key2}{mean};
my ($ssxx, $ssxy, $ssyy) = ($cache->{$key1}, 0, $cache->{$key2});
for(my $i = 0; $i < @$arr1; $i++){
$ssxy += ($arr1->[$i] - $mean_x)*($arr2->[$i] - $mean_y) ;
}
return $ssxy/sqrt( $ssxx * $ssyy );
}
Update: as moritz suggested, PDL::Stats::Basic might provide the optimal solution. My response is using plain perl. | [reply] [d/l] |
|
|