Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Statistical data analysis

by moxliukas (Curate)
on Jun 05, 2002 at 12:56 UTC ( #171816=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility Scripts
Author/Contact Info Petras Kudaras, aka moxliukas (moxliukas AT delfi DOT lt)
Description: This short program outputs some statistical analysis data given the input data in two tab separated columns, first one being X column, and the second one Y column. It calculates means, quartiles, median, variance and standard deviation for both sets of data. It also outputs various sumations (X, X^2, Y, Y^2 and X*Y). It then calculates covariance, linear correlaton coeficient and determinance and finally comes up with linear regression equation.
Most of this is simple and straightforward maths and I do hope it will prove useful to someone (well, I have used this script for my statistics lectures).
use strict;
use warnings;
# Copyleft moxliukas 2002
# This program outputs some statistical information
# about the data that is given in two tab separated 
# columns, first one being X column, and second one
# Y column.
unless (@ARGV==1) {
    print "Usage: $0 data_file\n";
    exit (-1);
}
my (@x, @y, $f);
open($f, "<$ARGV[0]");
while(<$f>) {
    /^(.*)\t(.*)$/;
    push @x, $1;
    push @y, $2;
}
close $f;
my $Xn = $#x+1;
my $Yn = $#y+1;
my @X2 = map { $_ * $_ } @x;
my @Y2 = map { $_ * $_ } @y;
my (@XY, $Sx, $Sy, $SX2, $SY2, $SXY, $SX_Xv__Y_Yv);
for (my $i = 0; $i <= $#x; $i++) { push @XY, $x[$i] * $y[$i]; }
foreach my $o (@x) { $Sx += $o; }
foreach my $o (@y) { $Sy += $o; }
foreach my $o (@X2) { $SX2 += $o; }
foreach my $o (@Y2) { $SY2 += $o; }
foreach my $o (@XY) { $SXY += $o; }
my $MX = $Sx / $Xn;
my $MY = $Sy / $Yn;
my $VX = ($SX2 * $Xn - $Sx * $Sx) / ($Xn * $Xn);
my $VY = ($SY2 * $Yn - $Sy * $Sy) / ($Yn * $Yn);
my $SdevX = sqrt($VX);
my $SdevY = sqrt($VY);
my @Xsort = sort { $a <=> $b } @x;
my @Ysort = sort { $a <=> $b } @y;
my $q1 = int(.25 * ($Xn + 1)) - 1;
my $q2 = int(.5 * ($Xn + 1)) - 1;
my $q3 = int(.75 * ($Xn + 1)) - 1;
for(my $i = 0; $i <= $#x; $i++) {
  $SX_Xv__Y_Yv += ($x[$i] - $MX) * ($y[$i] - $MY);
}
my $cov = $SX_Xv__Y_Yv / $Xn;
my $correl = $cov / ($SdevX * $SdevY);
my $determ = $correl * $correl;
my $slope = ($Xn * $SXY - $Sx * $Sy) / ($Xn * $SX2 - $Sx * $Sx);
my $intercept = $MY - $slope * $MX;
#--- output ---#
print <<EOI;
Sums:
--------
X:    $Sx
Y:    $Sy
X^2:    $SX2
Y^2:    $SY2
XY:    $SXY
n:    $Xn
--------
Mean X:        $MX
Mean Y:        $MY
Variance X:    $VX
Variance Y:    $VY
St. dev. X:    $SdevX
St. dev. Y:    $SdevY
--------
Quartiles
--------
Q1 of X:    $Xsort[$q1]
Q2 of X:    $Xsort[$q2]
Q3 of X:    $Xsort[$q3]
Q1 of Y:    $Ysort[$q1]
Q2 of Y:    $Ysort[$q2]
Q3 of Y:    $Ysort[$q3]
--------
Covariance:        $cov
Linear correlation:    $correl
Determinance:        $determ
--------
Linear regression
--------
Slope:        $slope
Intercept:    $intercept
Y = $intercept + X * ($slope)
EOI

Comment on Statistical data analysis
Download Code
Re: Statistical data analysis
by thraxil (Prior) on Jun 05, 2002 at 15:35 UTC

    if this were a code review i'd point out the following:

    • use strict; and -w
    • your formatting is inconsistent. eg, the line:
      $VX = ($SX2*$Xn - $Sx * $Sx)/($Xn*$Xn);
    • i'd suggest using a HEREDOC to print the output instead of 27 or so print statements.

    otherwise, the code is pretty straightforward, the variables are generally well-named, and the math looks right to me. so a ++ despite my nitpicks.

    anders pearson

      Thanks for the nitpicking. I have changed everything the way you suggested and this hopefully resulted in a more readable code. Thank you again.
Re: Statistical data analysis
by Anonymous Monk on Jun 05, 2002 at 17:31 UTC

    Your method of calculating variance can result in loss of significant digits. For example, the following data produces a variance of 0 for X on my machine (causing div-by-zero errors later). The variance should be the same for both X and Y (as the data differs only by a constant).

    999999997	9999997
    999999998	9999998
    999999999	9999999
    

    Better results may be achieved by "centering" your variance calculation about the mean.

    my $VX = 0; $VX += ($_ - $MX)**2 for @x; $VX /= $Xn;

    You are also making far more passes over the data than necessary.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://171816]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-07-30 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (241 votes), past polls