Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Statistical data analysis

by moxliukas (Curate)
on Jun 05, 2002 at 12:56 UTC ( [id://171816]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info Petras Kudaras, aka moxliukas (moxliukas AT delfi DOT lt)
Description: This short program outputs some statistical analysis data given the input data in two tab separated columns, first one being X column, and the second one Y column. It calculates means, quartiles, median, variance and standard deviation for both sets of data. It also outputs various sumations (X, X^2, Y, Y^2 and X*Y). It then calculates covariance, linear correlaton coeficient and determinance and finally comes up with linear regression equation.
Most of this is simple and straightforward maths and I do hope it will prove useful to someone (well, I have used this script for my statistics lectures).
use strict;
use warnings;
# Copyleft moxliukas 2002
# This program outputs some statistical information
# about the data that is given in two tab separated 
# columns, first one being X column, and second one
# Y column.
unless (@ARGV==1) {
    print "Usage: $0 data_file\n";
    exit (-1);
}
my (@x, @y, $f);
open($f, "<$ARGV[0]");
while(<$f>) {
    /^(.*)\t(.*)$/;
    push @x, $1;
    push @y, $2;
}
close $f;
my $Xn = $#x+1;
my $Yn = $#y+1;
my @X2 = map { $_ * $_ } @x;
my @Y2 = map { $_ * $_ } @y;
my (@XY, $Sx, $Sy, $SX2, $SY2, $SXY, $SX_Xv__Y_Yv);
for (my $i = 0; $i <= $#x; $i++) { push @XY, $x[$i] * $y[$i]; }
foreach my $o (@x) { $Sx += $o; }
foreach my $o (@y) { $Sy += $o; }
foreach my $o (@X2) { $SX2 += $o; }
foreach my $o (@Y2) { $SY2 += $o; }
foreach my $o (@XY) { $SXY += $o; }
my $MX = $Sx / $Xn;
my $MY = $Sy / $Yn;
my $VX = ($SX2 * $Xn - $Sx * $Sx) / ($Xn * $Xn);
my $VY = ($SY2 * $Yn - $Sy * $Sy) / ($Yn * $Yn);
my $SdevX = sqrt($VX);
my $SdevY = sqrt($VY);
my @Xsort = sort { $a <=> $b } @x;
my @Ysort = sort { $a <=> $b } @y;
my $q1 = int(.25 * ($Xn + 1)) - 1;
my $q2 = int(.5 * ($Xn + 1)) - 1;
my $q3 = int(.75 * ($Xn + 1)) - 1;
for(my $i = 0; $i <= $#x; $i++) {
  $SX_Xv__Y_Yv += ($x[$i] - $MX) * ($y[$i] - $MY);
}
my $cov = $SX_Xv__Y_Yv / $Xn;
my $correl = $cov / ($SdevX * $SdevY);
my $determ = $correl * $correl;
my $slope = ($Xn * $SXY - $Sx * $Sy) / ($Xn * $SX2 - $Sx * $Sx);
my $intercept = $MY - $slope * $MX;
#--- output ---#
print <<EOI;
Sums:
--------
X:    $Sx
Y:    $Sy
X^2:    $SX2
Y^2:    $SY2
XY:    $SXY
n:    $Xn
--------
Mean X:        $MX
Mean Y:        $MY
Variance X:    $VX
Variance Y:    $VY
St. dev. X:    $SdevX
St. dev. Y:    $SdevY
--------
Quartiles
--------
Q1 of X:    $Xsort[$q1]
Q2 of X:    $Xsort[$q2]
Q3 of X:    $Xsort[$q3]
Q1 of Y:    $Ysort[$q1]
Q2 of Y:    $Ysort[$q2]
Q3 of Y:    $Ysort[$q3]
--------
Covariance:        $cov
Linear correlation:    $correl
Determinance:        $determ
--------
Linear regression
--------
Slope:        $slope
Intercept:    $intercept
Y = $intercept + X * ($slope)
EOI
Replies are listed 'Best First'.
Re: Statistical data analysis
by Anonymous Monk on Jun 05, 2002 at 17:31 UTC

    Your method of calculating variance can result in loss of significant digits. For example, the following data produces a variance of 0 for X on my machine (causing div-by-zero errors later). The variance should be the same for both X and Y (as the data differs only by a constant).

    999999997	9999997
    999999998	9999998
    999999999	9999999
    

    Better results may be achieved by "centering" your variance calculation about the mean.

    my $VX = 0; $VX += ($_ - $MX)**2 for @x; $VX /= $Xn;

    You are also making far more passes over the data than necessary.

Re: Statistical data analysis
by thraxil (Prior) on Jun 05, 2002 at 15:35 UTC

    if this were a code review i'd point out the following:

    • use strict; and -w
    • your formatting is inconsistent. eg, the line:
      $VX = ($SX2*$Xn - $Sx * $Sx)/($Xn*$Xn);
    • i'd suggest using a HEREDOC to print the output instead of 27 or so print statements.

    otherwise, the code is pretty straightforward, the variables are generally well-named, and the math looks right to me. so a ++ despite my nitpicks.

    anders pearson

      Thanks for the nitpicking. I have changed everything the way you suggested and this hopefully resulted in a more readable code. Thank you again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://171816]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-18 00:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found