Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Improving the Nested For Loop

by Anonymous Monk
on Sep 02, 2014 at 04:28 UTC ( [id://1099219]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need your wisdom to improve my "nested for loop". I am trying to do correlation calculation for about 50,000 categories with each category has 100 values. Here is the snippet of my code:

#!/tools/bin/perl use strict; use warnings; ## Variables and Data Structures my $count = 1; my @probesArray my %probes; my $size = 100; ## Reading the file open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (my $line = <$FILE>){ chomp $line; my @line = split('\t',$line); $probes{$line[0]} = [@line[1 .. $#line]]; ## value of the hash as an a +rray $probesArray[$count] = $line[0]; ## correlation between 1-2 or 2-1 wil +l be same so using calculating only once can be done thru this array $count++; } close($FILE); ## Frankly speaking reading of the file takes less than +3 sec with 50,000 categories each having 100 values ## Correlation Calculation for(my $i = 0; $i <= $count-1; $i++){ for(my $j = $i+1; $j <= $count; $j++){ my @probe1 = @{$probes{$probesArray[$i]}}; ## Array of data my @probe2 = @{$probes{$probesArray[$j]}}; my $cor = correlation(\@probe1, \@probe2, \$size); ## correlat +ion is the subroutine $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor; # print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t", +$cor,"\n"; $count++; } } ## Subroutines sub mean { my ($arr1, $arr2, $size) = @_; my @arr1 = @$arr1; my @arr2 = @$arr2; my $mu_x = sum(@arr1) / $$size; my $mu_y = sum(@arr2) / $$size; return($mu_x,$mu_y); } ## Sum of Squared Deviations to the mean sub ss { my ($arr1, $arr2, $mean_x,$mean_y) = @_; my @arr1 = @$arr1; my @arr2 = @$arr2; my ($ssxx, $ssxy, $ssyy) = (0) x 3; ## looping over all the samples for(my $i = 0; $i <= scalar(@arr1)-1; $i++){ $ssxx = $ssxx + ($arr1[$i] - $mean_x)**2; $ssxy = $ssxy + ($arr1[$i] - $mean_x)*($arr2[$i] - $mean_y) ; $ssyy = $ssyy + ($arr2[$i] - $mean_y)**2; } return ($ssxx, $ssxy, $ssyy); } ## Pearson Correlation Coefficient sub correlation { my ($arr1, $arr2, $size) = @_; my ($mean_x,$mean_y) = mean($arr1, $arr2, $size); my ($ssxx, $ssxy, $ssyy) = ss($arr1, $arr2, $mean_x, $mean_y); my $cor = $ssxy/sqrt($ssxx*$ssyy); return($cor); }

Correlation Calculation is taking a lot of time. I mean calculating correlation between category1 vs all (i.e. 50000) is taking around 40 sec. Now, if that is the speed then, it will take more than 12 days to calculate the correlation coefficient for all the 50k categories. Nested for loop is taking a lot of time. Is there a way to decrease the run time or am I doing something wrong.

Any hints or advice would be highly appreciated. Thanks!

Replies are listed 'Best First'.
Re: Improving the Nested For Loop
by BrowserUk (Patriarch) on Sep 02, 2014 at 06:12 UTC

    You are doing an awful lot of completely unnecessary copying of arrays.

    Try this clean-compiling (unlike your posted code), but untested version, It should be substantially faster:

    #! perl -slw use strict; ## Variables and Data Structures my $count = 1; my @probesArray; my %probes; my $size = 100; ## Reading the file open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (my $line = <$FILE>){ chomp $line; my @line = split '\t', $line; $probes{ $line[0] } = [ @line[ 1 .. $#line ] ]; ## value of the ha +sh as an array ## correlation between 1-2 or 2-1 will be same so using calculatin +g only once can be done thru this array $probesArray[ $count ] = $line[ 0 ]; $count++; } ## Frankly speaking reading of the file takes less than 3 sec with 50, +000 categories each having 100 values close $FILE; ## Correlation Calculation for( my $i = 0; $i <= $count-1; $i++ ){ for( my $j = $i+1; $j <= $count; $j++ ){ my $cor = correlation( $probes{ $probesArray[$i] }, $probes{ $ +probesArray[$j] }, $size ); ## correlation is the subroutine $probes{ $probesArray[$i] . "-" . $probesArray[$j] } = $cor; # print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",$ +cor,"\n"; $count++; } } ## Subroutines ## Pearson Correlation Coefficient sub correlation { my( $arr1, $arr2, $size ) = @_; my( $mean_x, $mean_y ) = mean( $arr1, $arr2, $size ); my( $ssxx, $ssxy, $ssyy ) = ss( $arr1, $arr2, $mean_x, $mean_y ); my $cor = $ssxy / sqrt( $ssxx * $ssyy ); return $cor ; } sub mean { my( $arr1, $arr2, $size ) = @_; my $mu_x = sum( @$arr1 ) / $size; my $mu_y = sum( @$arr2 ) / $size; return($mu_x,$mu_y); } ## Sum of Squared Deviations to the mean sub ss { my( $arr1, $arr2, $mean_x, $mean_y ) = @_; my( $ssxx, $ssxy, $ssyy ) = (0) x 3; ## looping over all the samples for( my $i = 0; $i < @$arr1; $i++ ){ $ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2; $ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $m +ean_y ); $ssyy = $ssyy + ( $arr2->[$i] - $mean_y )**2; } return ($ssxx, $ssxy, $ssyy); }

    Ask questions about anything you do not understand.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Improving the Nested For Loop
by moritz (Cardinal) on Sep 02, 2014 at 07:33 UTC
Re: Improving the Nested For Loop
by CountZero (Bishop) on Sep 02, 2014 at 06:37 UTC
    You are calling the correlation sub 50_000 times 50_000 divided by 2 (I think): 11_250_000_000 calls in total. That is quiet a lot so any change to your correlation sub will go a long way.

    I see you start with some array-references which you de-reference (turn back into an array), which you then reference again in the call to correlation and passing these references in the calls to mean and ss. And in those subs, you de-reference these references again and put everyting in an array again. I suggest that you keep working with references all the way through and never copy them into an array. This is less readable and you have to take care that you do not change any of the values of those arrays. I think you are OK on this point.

    Another speed-up is looking for faster calculation of the correlation. Perhaps there is an XS module somewhere on CPAN that does this faster. Did you check?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Improving the Nested For Loop
by hdb (Monsignor) on Sep 02, 2014 at 06:52 UTC

    In addition what the fellow monks advised above already, you do calculate the mean and the standard deviation of each category many times. You can save some time by doing it only once and store the results (e.g. while loading the data). You could also standardize your data (by subtracting the mean and dividing by the standard deviation) which also takes out the repeated calculation of mean and standard deviation in your nested loop.

      The Memoize module would be a good and easy way to avoid repeated calculations.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: Improving the Nested For Loop
by Laurent_R (Canon) on Sep 02, 2014 at 06:12 UTC
    Hmm, you're doing a lot of computing, this is bound to take quite a bit of time. I would suggest that you start by profiling your program to figure out exactly which part of your code is taking long time. You might want to use a tool such a Devel::NYTProf.
Re: Improving the Nested For Loop
by Cristoforo (Curate) on Sep 02, 2014 at 22:57 UTC
    This (untested) approach uses a cache to get the Sum of Squared Deviations to the mean one time. Your approach does this calculation for each call to the correlation sub. So this saves about 11 billion or more (more I think :-) )calculations.

    #!/tools/bin/perl use strict; use warnings; use List::Util 'sum'; my @keys; my $probes; open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (<$FILE>){ chomp; my ($key, @line) = split /\t/; $probes->{$key}{vals} = \@line; $probes->{$key}{mean} = sum(@line) / @line; push @keys, $key; } close($FILE) or die $!; my $cache; ## Sum of Squared Deviations to the mean for my $key (@keys) { my $mean = $probes->{$key}{mean}; my $ss; for my $dat (@{ $probes->{$key}{vals} }) { $ss += ($dat - $mean) ** 2; } $cache->{$key} = $ss; } ## Correlation Calculation my $count = 1; for my $i (0 .. $#keys){ for my $j ($i+1 .. $#keys){ my $cor = correlation($probes, $cache, @keys[$i, $j]); # $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor; print $count++,"\t", "$keys[$i]-$keys[$j]\t$cor\n"; } } ## Sum of Squared Deviations to the mean sub correlation { my ($probes, $cache, $key1, $key2) = @_; my $arr1 = $probes->{$key1}{vals}; my $arr2 = $probes->{$key2}{vals}; my $mean_x = $probes->{$key1}{mean}; my $mean_y = $probes->{$key2}{mean}; my ($ssxx, $ssxy, $ssyy) = ($cache->{$key1}, 0, $cache->{$key2}); for(my $i = 0; $i < @$arr1; $i++){ $ssxy += ($arr1->[$i] - $mean_x)*($arr2->[$i] - $mean_y) ; } return $ssxy/sqrt( $ssxx * $ssyy ); }
    Update: as moritz suggested, PDL::Stats::Basic might provide the optimal solution. My response is using plain perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1099219]
Approved by farang
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-24 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found