Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Statistical data analysis

by moxliukas (Curate)
on Jun 05, 2002 at 12:56 UTC ( #171816=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Petras Kudaras, aka moxliukas (moxliukas AT delfi DOT lt)
Description: This short program outputs some statistical analysis data given the input data in two tab separated columns, first one being X column, and the second one Y column. It calculates means, quartiles, median, variance and standard deviation for both sets of data. It also outputs various sumations (X, X^2, Y, Y^2 and X*Y). It then calculates covariance, linear correlaton coeficient and determinance and finally comes up with linear regression equation.
Most of this is simple and straightforward maths and I do hope it will prove useful to someone (well, I have used this script for my statistics lectures).
use strict;
use warnings;
# Copyleft moxliukas 2002
# This program outputs some statistical information
# about the data that is given in two tab separated 
# columns, first one being X column, and second one
# Y column.
unless (@ARGV==1) {
    print "Usage: $0 data_file\n";
    exit (-1);
my (@x, @y, $f);
open($f, "<$ARGV[0]");
while(<$f>) {
    push @x, $1;
    push @y, $2;
close $f;
my $Xn = $#x+1;
my $Yn = $#y+1;
my @X2 = map { $_ * $_ } @x;
my @Y2 = map { $_ * $_ } @y;
my (@XY, $Sx, $Sy, $SX2, $SY2, $SXY, $SX_Xv__Y_Yv);
for (my $i = 0; $i <= $#x; $i++) { push @XY, $x[$i] * $y[$i]; }
foreach my $o (@x) { $Sx += $o; }
foreach my $o (@y) { $Sy += $o; }
foreach my $o (@X2) { $SX2 += $o; }
foreach my $o (@Y2) { $SY2 += $o; }
foreach my $o (@XY) { $SXY += $o; }
my $MX = $Sx / $Xn;
my $MY = $Sy / $Yn;
my $VX = ($SX2 * $Xn - $Sx * $Sx) / ($Xn * $Xn);
my $VY = ($SY2 * $Yn - $Sy * $Sy) / ($Yn * $Yn);
my $SdevX = sqrt($VX);
my $SdevY = sqrt($VY);
my @Xsort = sort { $a <=> $b } @x;
my @Ysort = sort { $a <=> $b } @y;
my $q1 = int(.25 * ($Xn + 1)) - 1;
my $q2 = int(.5 * ($Xn + 1)) - 1;
my $q3 = int(.75 * ($Xn + 1)) - 1;
for(my $i = 0; $i <= $#x; $i++) {
  $SX_Xv__Y_Yv += ($x[$i] - $MX) * ($y[$i] - $MY);
my $cov = $SX_Xv__Y_Yv / $Xn;
my $correl = $cov / ($SdevX * $SdevY);
my $determ = $correl * $correl;
my $slope = ($Xn * $SXY - $Sx * $Sy) / ($Xn * $SX2 - $Sx * $Sx);
my $intercept = $MY - $slope * $MX;
#--- output ---#
print <<EOI;
X:    $Sx
Y:    $Sy
X^2:    $SX2
Y^2:    $SY2
XY:    $SXY
n:    $Xn
Mean X:        $MX
Mean Y:        $MY
Variance X:    $VX
Variance Y:    $VY
St. dev. X:    $SdevX
St. dev. Y:    $SdevY
Q1 of X:    $Xsort[$q1]
Q2 of X:    $Xsort[$q2]
Q3 of X:    $Xsort[$q3]
Q1 of Y:    $Ysort[$q1]
Q2 of Y:    $Ysort[$q2]
Q3 of Y:    $Ysort[$q3]
Covariance:        $cov
Linear correlation:    $correl
Determinance:        $determ
Linear regression
Slope:        $slope
Intercept:    $intercept
Y = $intercept + X * ($slope)
Replies are listed 'Best First'.
Re: Statistical data analysis
by Anonymous Monk on Jun 05, 2002 at 17:31 UTC

    Your method of calculating variance can result in loss of significant digits. For example, the following data produces a variance of 0 for X on my machine (causing div-by-zero errors later). The variance should be the same for both X and Y (as the data differs only by a constant).

    999999997	9999997
    999999998	9999998
    999999999	9999999

    Better results may be achieved by "centering" your variance calculation about the mean.

    my $VX = 0; $VX += ($_ - $MX)**2 for @x; $VX /= $Xn;

    You are also making far more passes over the data than necessary.

Re: Statistical data analysis
by thraxil (Prior) on Jun 05, 2002 at 15:35 UTC

    if this were a code review i'd point out the following:

    • use strict; and -w
    • your formatting is inconsistent. eg, the line:
      $VX = ($SX2*$Xn - $Sx * $Sx)/($Xn*$Xn);
    • i'd suggest using a HEREDOC to print the output instead of 27 or so print statements.

    otherwise, the code is pretty straightforward, the variables are generally well-named, and the math looks right to me. so a ++ despite my nitpicks.

    anders pearson

      Thanks for the nitpicking. I have changed everything the way you suggested and this hopefully resulted in a more readable code. Thank you again.
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://171816]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2018-01-16 20:26 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (191 votes). Check out past polls.