### Issue on covariance calculation

by Mandrake (Chaplain)
 on Apr 13, 2007 at 08:03 UTC Need Help??

Mandrake has asked for the wisdom of the Perl Monks concerning the following question:

Hi there..
I am in the process of building a huge covariance matrix( around 10000 x 10000 elements). Since covariance matrix is a symmetric matrix, I attempt to create only one half of the matrix.
```0.22929292
0.32928322, 0.528289202
0.19383838, 0.32892929992, 0.78299283839
0.929283893, 0.4829299299, 0.628299292, 0.929389393
.
.
.
The input the script is an array of comma separated values
```0.2939383839, -0.0929288282,0.1293893939, 0.833883929 . . . 250 elemen
+ts
.
.
.
1000 elements
The relevant part of my script :
```
open TMP, "datafile.txt" || die ("Could not open data file\n") ;
my @input_data = <TMP> ;
close TMP;

my \$record_count = 0;

# Loop through the array
for my \$element (@input_data) {

chomp;
@first = split(",", \$element ) ;

my \$jcount = 0;
my @result;
# Loop through the array again

for my \$inside_elem (@input_data) {
chomp;

@second = split(",", \$inside_elem) ;

# Build only elements below the diagonal
last if ( \$jcount++ > \$record_count  ) ;

# Covariance logic
# No issues with the logic
# sum of products of corresponding elements in the arrays
\$sum = 0;
\$count = 0;
for (@first) {
\$sum += \$_ * \$second[\$count++] ;
}
\$sum /= scalar(@first) ;

push @result, \$sum ;
}
my \$str_to_write =  join(", ", @result)."\n"  ;
undef @result ;

# Open the output file handler in append mode.
open TMP, ">>Outfile.txt" || die ("Could not open output file ");

print TMP \$str_to_write;

close TMP ;
}
The issues with my script.
1. Performance.
It took almost an hour for the first 2500 (out of 10000) records to get generated / written to output file. I badly want to optimize the performance. Can you wise ones give some suggestions?

2. Storage space.
The file size of the output file would be almost a GB (or even more). Sorry if this does not make sense but I was given a suggestion that instead of writing the data as text, if we write it as binary it would save space. I mean, 0.09992020202 would take 14 bytes of space. Instead of character if we write it as float, it should take less space. Is this idea possible to be implemented in Perl ?

Replies are listed 'Best First'.
Re: Issue on covariance calculation
by rodion (Chaplain) on Apr 13, 2007 at 08:53 UTC
Opening and closing a file is quite time consuming on most operating systems. Your code opens and closes "Outfile.txt" 10,000 times, once for each input line. You also have to find the end of the file each time you open it, which will take quite a while as the file gets large.

Try taking the open and close for that file outside the "for my \$element" loop, so it's opened and closed only once. That should speed things up a lot.

Re: Issue on covariance calculation
by GrandFather (Saint) on Apr 13, 2007 at 08:59 UTC

Some non-issue stuff for a start. First, always use strictures (use strict; use warnings;). In this case you might discover that the chomps are chomping on \$_ and not the loop variables I suspect you think they are chomping on. However you would be much better to chomp the whole slurped array rather than performing @input_data + @input_data2 chomps:

```chomp @input_data;

There seems to be a missing ++\$record_count; (I assume following the outer loop print).

Some advantage can be gained by pre-splitting the lines. Consider:

```use strict;
use warnings;
use Benchmark qw(cmpthese);

my @data = <DATA>;

cmpthese (-1,
{
original => sub {original (@data)},
presplit => sub {presplit (@data)},
});

sub original {
my @input_data = @_;

my \$record_count = 0;

# Loop through the array
for my \$element (@input_data) {

chomp \$element;
my @first = split(",", \$element ) ;

my \$jcount = 0;
my @result;
# Loop through the array again

for my \$inside_elem (@input_data) {
chomp \$inside_elem;

my @second = split(",", \$inside_elem) ;

# Build only elements below the diagonal
last if ( \$jcount++ > \$record_count  ) ;

# Covariance logic
# No issues with the logic
# sum of products of corresponding elements in the arrays
my \$sum = 0;
my \$count = 0;
for (@first) {
\$sum += \$_ * \$second[\$count++] ;
}
\$sum /= scalar(@first) ;

push @result, \$sum ;
}

my \$str_to_write =  join(", ", @result)."\n"  ;
undef @result ;

# Open the output file handler in append mode.
#print \$str_to_write;
++\$record_count;
}
}

sub presplit {
my @input_data = @_;
my \$record_count = 0;

chomp @input_data;
@input_data = map {[split ',']} @input_data;

# Loop through the array
for my \$first (@input_data) {
my \$jcount = 0;
my @result;

# Loop through the array again
for my \$second (@input_data) {
# Build only elements below the diagonal
last if \$jcount++ > \$record_count;

# Covariance logic
# No issues with the logic
# sum of products of corresponding elements in the arrays
my \$sum = 0;
my \$count = 0;

\$sum += \$_ * \$second->[\$count++] for @\$first;
\$sum /= scalar @\$first;
push @result, \$sum ;
}

my \$str_to_write =  join(", ", @result)."\n"  ;

# Open the output file handler in append mode.
#print \$str_to_write;
++\$record_count;
}
}

__DATA__
0.2939383839, -0.0929288282, 0.2939383839, -0.0929288282, -0.092928828
+2, 0.2939383839, -0.0929288282
0.0139383839, -0.1929288282, 0.0139383839, -0.1929288282, -0.192928828
+2, 0.0139383839, -0.1929288282
0.2009383839,  0.0929288282, 0.2009383839,  0.0929288282,  0.092928828
+2, 0.2009383839,  0.0929288282
0.0939383839,  0.5929288282, 0.0939383839,  0.5929288282,  0.592928828
+2, 0.0939383839,  0.5929288282
0.2939383839, -0.0929288282, 0.2939383839, -0.0929288282, -0.092928828
+2, 0.2939383839, -0.0929288282
0.0139383839, -0.1929288282, 0.0139383839, -0.1929288282, -0.192928828
+2, 0.0139383839, -0.1929288282
0.2009383839,  0.0929288282, 0.2009383839,  0.0929288282,  0.092928828
+2, 0.2009383839,  0.0929288282
0.0939383839,  0.5929288282, 0.0939383839,  0.5929288282,  0.592928828
+2, 0.0939383839,  0.5929288282
0.2939383839, -0.0929288282, 0.2939383839, -0.0929288282, -0.092928828
+2, 0.2939383839, -0.0929288282
0.0139383839, -0.1929288282, 0.0139383839, -0.1929288282, -0.192928828
+2, 0.0139383839, -0.1929288282
0.2009383839,  0.0929288282, 0.2009383839,  0.0929288282,  0.092928828
+2, 0.2009383839,  0.0929288282
0.0939383839,  0.5929288282, 0.0939383839,  0.5929288282,  0.592928828
+2, 0.0939383839,  0.5929288282
0.2939383839, -0.0929288282, 0.2939383839, -0.0929288282, -0.092928828
+2, 0.2939383839, -0.0929288282
0.0139383839, -0.1929288282, 0.0139383839, -0.1929288282, -0.192928828
+2, 0.0139383839, -0.1929288282
0.2009383839,  0.0929288282, 0.2009383839,  0.0929288282,  0.092928828
+2, 0.2009383839,  0.0929288282
0.0939383839,  0.5929288282, 0.0939383839,  0.5929288282,  0.592928828
+2, 0.0939383839,  0.5929288282
0.2939383839, -0.0929288282, 0.2939383839, -0.0929288282, -0.092928828
+2, 0.2939383839, -0.0929288282

Prints:

```          Rate original presplit
original 205/s       --     -59%
presplit 498/s     143%       --

DWIM is Perl's answer to Gödel
Re: Issue on covariance calculation
by xdg (Monsignor) on Apr 13, 2007 at 11:29 UTC

Perl is fairly memory inefficient for this kind of mathematical calculation. Each scalar is represented internally by a data structure rather than just the number itself.

I suggest you look at PDL. C.f The Perl Data Language (PDL): A Quick Reference Guide

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Issue on covariance calculation
by ambrus (Abbot) on Apr 13, 2007 at 08:30 UTC

The big advantage of binary format is not that it saves space, but that you can find any element without having to read the whole file. In perl, you can do that with the pack builtin.

I disagree. You can do random access with ASCII so long as you write fixed length records, but it does take about 50% longer to do the conversion. It's the conversion that chews up most of the time, so pack() can be a substantial time saver.

In the test below, writing floats is 10 times faster than writhing ascii. The file is half the size, so that smaller size only accounts for a factor of 2 speed-up. A factor of 5 comes from not doing the ascii conversion.

```use warnings;
use strict;

use Time::HiRes qw( gettimeofday tv_interval );

my \$size = 1000;
my \$last_idx = \$size-1;
my \$float_value;

my \$t0;
\$t0 = [gettimeofday];

open ASC, '>', "asciinum.txt"  or die "can't open ascii file";
for my \$row (0..\$last_idx) {
my @result = ();
my \$row_len = \$row+1;
\$float_value = \$row/10_000 + 1/100_000_000;
for my \$col (0..\$last_idx) {
#\$float_value = \$row/10_000 + \$col/100_000_000;
push @result, \$float_value;
}
print ASC join(',', @result);
}
print STDERR "Ascii write took -- ",tv_interval ( \$t0 ),"\n";
close ASC;

\$t0 = [gettimeofday];

open FLOAT, '>', "floatnum.txt"  or die "can't open float file";
for my \$row (0..\$last_idx) {
my @result = ();
my \$row_len = \$row+1;
\$float_value = \$row/10_000 + 1/100_000_000;
for my \$col (0..\$last_idx) {
#\$float_value = \$row/10_000 + \$col/100_000_000;
push @result, \$float_value;
}
print FLOAT pack("F\$row_len", @result);
}
print STDERR "Float write took -- ",tv_interval ( \$t0 ),"\n";
close FLOAT;

# Gave the following results on my fairly slow machine, with \$size = 1
+000
#  Ascii write took -- 15.078125
#  Float write took -- 2.778736
Re: Issue on covariance calculation
by johngg (Canon) on Apr 13, 2007 at 08:53 UTC
Not relevant to your performance issues but your chomp; statements inside the two for loops are not doing anything useful. Without an argument, chomp acts on \$_ whereas you want it to act on your loop variables. So do chomp \$element; and chomp \$inside_elem in the outer and inner loops respectively.

Cheers,

JohnGG

Thanks for the reply. In fact, the previous version of the code was in that way. Forgot to change when modified it.
```for (@data) {
.
.
.
Thanks for pointing the mistake out.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://609844]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2021-06-24 21:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
What does the "s" stand for in "perls"? (Whence perls)

Results (132 votes). Check out past polls.

Notices?