Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
The stupid question is the question not asked
 
PerlMonks  

Statistical data analysis

by moxliukas (Curate)
on Jun 05, 2002 at 12:56 UTC ( #171816=sourcecode: print w/ replies, xml ) Need Help??

Category: Utility Scripts
Author/Contact Info Petras Kudaras, aka moxliukas (moxliukas AT delfi DOT lt)
Description: This short program outputs some statistical analysis data given the input data in two tab separated columns, first one being X column, and the second one Y column. It calculates means, quartiles, median, variance and standard deviation for both sets of data. It also outputs various sumations (X, X^2, Y, Y^2 and X*Y). It then calculates covariance, linear correlaton coeficient and determinance and finally comes up with linear regression equation.
Most of this is simple and straightforward maths and I do hope it will prove useful to someone (well, I have used this script for my statistics lectures).
use strict;
use warnings;
# Copyleft moxliukas 2002
# This program outputs some statistical information
# about the data that is given in two tab separated 
# columns, first one being X column, and second one
# Y column.
unless (@ARGV==1) {
    print "Usage: $0 data_file\n";
    exit (-1);
}
my (@x, @y, $f);
open($f, "<$ARGV[0]");
while(<$f>) {
    /^(.*)\t(.*)$/;
    push @x, $1;
    push @y, $2;
}
close $f;
my $Xn = $#x+1;
my $Yn = $#y+1;
my @X2 = map { $_ * $_ } @x;
my @Y2 = map { $_ * $_ } @y;
my (@XY, $Sx, $Sy, $SX2, $SY2, $SXY, $SX_Xv__Y_Yv);
for (my $i = 0; $i <= $#x; $i++) { push @XY, $x[$i] * $y[$i]; }
foreach my $o (@x) { $Sx += $o; }
foreach my $o (@y) { $Sy += $o; }
foreach my $o (@X2) { $SX2 += $o; }
foreach my $o (@Y2) { $SY2 += $o; }
foreach my $o (@XY) { $SXY += $o; }
my $MX = $Sx / $Xn;
my $MY = $Sy / $Yn;
my $VX = ($SX2 * $Xn - $Sx * $Sx) / ($Xn * $Xn);
my $VY = ($SY2 * $Yn - $Sy * $Sy) / ($Yn * $Yn);
my $SdevX = sqrt($VX);
my $SdevY = sqrt($VY);
my @Xsort = sort { $a <=> $b } @x;
my @Ysort = sort { $a <=> $b } @y;
my $q1 = int(.25 * ($Xn + 1)) - 1;
my $q2 = int(.5 * ($Xn + 1)) - 1;
my $q3 = int(.75 * ($Xn + 1)) - 1;
for(my $i = 0; $i <= $#x; $i++) {
  $SX_Xv__Y_Yv += ($x[$i] - $MX) * ($y[$i] - $MY);
}
my $cov = $SX_Xv__Y_Yv / $Xn;
my $correl = $cov / ($SdevX * $SdevY);
my $determ = $correl * $correl;
my $slope = ($Xn * $SXY - $Sx * $Sy) / ($Xn * $SX2 - $Sx * $Sx);
my $intercept = $MY - $slope * $MX;
#--- output ---#
print <<EOI;
Sums:
--------
X:    $Sx
Y:    $Sy
X^2:    $SX2
Y^2:    $SY2
XY:    $SXY
n:    $Xn
--------
Mean X:        $MX
Mean Y:        $MY
Variance X:    $VX
Variance Y:    $VY
St. dev. X:    $SdevX
St. dev. Y:    $SdevY
--------
Quartiles
--------
Q1 of X:    $Xsort[$q1]
Q2 of X:    $Xsort[$q2]
Q3 of X:    $Xsort[$q3]
Q1 of Y:    $Ysort[$q1]
Q2 of Y:    $Ysort[$q2]
Q3 of Y:    $Ysort[$q3]
--------
Covariance:        $cov
Linear correlation:    $correl
Determinance:        $determ
--------
Linear regression
--------
Slope:        $slope
Intercept:    $intercept
Y = $intercept + X * ($slope)
EOI

Comment on Statistical data analysis
Download Code
Re: Statistical data analysis
by thraxil (Prior) on Jun 05, 2002 at 15:35 UTC

    if this were a code review i'd point out the following:

    • use strict; and -w
    • your formatting is inconsistent. eg, the line:
      $VX = ($SX2*$Xn - $Sx * $Sx)/($Xn*$Xn);
    • i'd suggest using a HEREDOC to print the output instead of 27 or so print statements.

    otherwise, the code is pretty straightforward, the variables are generally well-named, and the math looks right to me. so a ++ despite my nitpicks.

    anders pearson

      Thanks for the nitpicking. I have changed everything the way you suggested and this hopefully resulted in a more readable code. Thank you again.
Re: Statistical data analysis
by Anonymous Monk on Jun 05, 2002 at 17:31 UTC

    Your method of calculating variance can result in loss of significant digits. For example, the following data produces a variance of 0 for X on my machine (causing div-by-zero errors later). The variance should be the same for both X and Y (as the data differs only by a constant).

    999999997	9999997
    999999998	9999998
    999999999	9999999
    

    Better results may be achieved by "centering" your variance calculation about the mean.

    my $VX = 0; $VX += ($_ - $MX)**2 for @x; $VX /= $Xn;

    You are also making far more passes over the data than necessary.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://171816]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2014-04-24 00:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (557 votes), past polls