Category: 
Utility Scripts 
Author/Contact Info 
Petras Kudaras, aka moxliukas (moxliukas AT delfi DOT lt) 
Description: 
This short program outputs some statistical analysis data given the input data in two tab separated columns, first one being X column, and the second one Y column. It calculates means, quartiles, median, variance and standard deviation for both sets of data. It also outputs various sumations (X, X^2, Y, Y^2 and X*Y). It then calculates covariance, linear correlaton coeficient and determinance and finally comes up with linear regression equation.
Most of this is simple and straightforward maths and I do hope it will prove useful to someone (well, I have used this script for my statistics lectures). 
use strict;
use warnings;
# Copyleft moxliukas 2002
# This program outputs some statistical information
# about the data that is given in two tab separated
# columns, first one being X column, and second one
# Y column.
unless (@ARGV==1) {
print "Usage: $0 data_file\n";
exit (1);
}
my (@x, @y, $f);
open($f, "<$ARGV[0]");
while(<$f>) {
/^(.*)\t(.*)$/;
push @x, $1;
push @y, $2;
}
close $f;
my $Xn = $#x+1;
my $Yn = $#y+1;
my @X2 = map { $_ * $_ } @x;
my @Y2 = map { $_ * $_ } @y;
my (@XY, $Sx, $Sy, $SX2, $SY2, $SXY, $SX_Xv__Y_Yv);
for (my $i = 0; $i <= $#x; $i++) { push @XY, $x[$i] * $y[$i]; }
foreach my $o (@x) { $Sx += $o; }
foreach my $o (@y) { $Sy += $o; }
foreach my $o (@X2) { $SX2 += $o; }
foreach my $o (@Y2) { $SY2 += $o; }
foreach my $o (@XY) { $SXY += $o; }
my $MX = $Sx / $Xn;
my $MY = $Sy / $Yn;
my $VX = ($SX2 * $Xn  $Sx * $Sx) / ($Xn * $Xn);
my $VY = ($SY2 * $Yn  $Sy * $Sy) / ($Yn * $Yn);
my $SdevX = sqrt($VX);
my $SdevY = sqrt($VY);
my @Xsort = sort { $a <=> $b } @x;
my @Ysort = sort { $a <=> $b } @y;
my $q1 = int(.25 * ($Xn + 1))  1;
my $q2 = int(.5 * ($Xn + 1))  1;
my $q3 = int(.75 * ($Xn + 1))  1;
for(my $i = 0; $i <= $#x; $i++) {
$SX_Xv__Y_Yv += ($x[$i]  $MX) * ($y[$i]  $MY);
}
my $cov = $SX_Xv__Y_Yv / $Xn;
my $correl = $cov / ($SdevX * $SdevY);
my $determ = $correl * $correl;
my $slope = ($Xn * $SXY  $Sx * $Sy) / ($Xn * $SX2  $Sx * $Sx);
my $intercept = $MY  $slope * $MX;
# output #
print <<EOI;
Sums:

X: $Sx
Y: $Sy
X^2: $SX2
Y^2: $SY2
XY: $SXY
n: $Xn

Mean X: $MX
Mean Y: $MY
Variance X: $VX
Variance Y: $VY
St. dev. X: $SdevX
St. dev. Y: $SdevY

Quartiles

Q1 of X: $Xsort[$q1]
Q2 of X: $Xsort[$q2]
Q3 of X: $Xsort[$q3]
Q1 of Y: $Ysort[$q1]
Q2 of Y: $Ysort[$q2]
Q3 of Y: $Ysort[$q3]

Covariance: $cov
Linear correlation: $correl
Determinance: $determ

Linear regression

Slope: $slope
Intercept: $intercept
Y = $intercept + X * ($slope)
EOI

Re: Statistical data analysis by thraxil (Prior) on Jun 05, 2002 at 15:35 UTC 
if this were a code review i'd point out the following:
otherwise, the code is pretty straightforward, the variables are generally wellnamed, and the math looks right to me. so a ++ despite my nitpicks.
anders pearson
 [reply] 

Thanks for the nitpicking. I have changed everything the way you suggested and this hopefully resulted in a more readable code.
Thank you again.
 [reply] 
Re: Statistical data analysis by Anonymous Monk on Jun 05, 2002 at 17:31 UTC 
Your method of calculating variance can result in loss of significant
digits. For example, the following data produces a variance of 0 for
X on my machine (causing divbyzero errors later). The variance
should be the same for both X and Y (as the data differs only by a
constant).
999999997 9999997
999999998 9999998
999999999 9999999
Better results may be achieved by "centering" your variance
calculation about the mean.
my $VX = 0;
$VX += ($_  $MX)**2 for @x;
$VX /= $Xn;
You are also making far more passes over the data than necessary.
 [reply] [d/l] 
Back to Code Catacombs

