Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Stats: Testing whether data is normally (Gaussian) distributed

by andye (Curate)
on Apr 27, 2007 at 19:12 UTC ( #612443=perlquestion: print w/replies, xml ) Need Help??
andye has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have some data. I'd like to check whether each of my datasets is normally distributed.

Because of virtues 1 and 2, I'd like to find a module that implements one of the normality test algorithms, rather than doing it myself.

I've looked on CPAN but can't find anything suitable. The R library has some suitable routines in nortest (PDF) but I'm not looking forward to trying to use those from Perl (Statistics::R) or writing a C wrapper to use them from PDL (PDL::PP).

Or of course I could just code the routine myself... my stats is pretty rusty though and there's a fair chance I'd screw it up somehow.

Does anyone have any easier options for me?

Best wishes, and happy weekend to one and all,

  • Comment on Stats: Testing whether data is normally (Gaussian) distributed

Replies are listed 'Best First'.
Re: Stats: Testing whether data is normally (Gaussian) distributed
by lin0 (Curate) on Apr 27, 2007 at 20:11 UTC
      Currently the ChiSquare module assumes that the data you're testing is meant to be evenly distributed. I've wanted to make that configurable for a long time, so if anyone can come up with a nice way of doing it, I would be delighted to apply your patch.

      I also need to hack on it to support more degrees of freedom.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by moklevat (Priest) on Apr 27, 2007 at 21:25 UTC
    Hi andye,

    It looks like it may be a trade-off between calling R or coding the algorithm yourself, and a smashing opportunity to write a module:-). Among the options for tests of normality you might consider Shapiro-Wilk, (there is a link to a Fortran version of the algorithm in the Wikipedia article), and the popular Kolmogorov-Smirnov test. The K-S is generalizable to many distributions, but may be more of a pain to implement than Shapiro-Wilk. I would recommend steering clear of the Anderson-Darling test, as it is overly sensitive with sample sizes greater than about 25 (as mentioned in Wikipedia).

    If you do roll your own, PDL is great for stats.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by DigitalKitty (Parson) on Apr 27, 2007 at 22:10 UTC
    Hi andye.

    I'm not sure if you will derive significant benefit from my suggestion but I felt compelled to offer:

    The OmegaHat Project

    If you find that your data is not normally distributed, you could use a non-parametric test (e.g. Kruskal-Wallis, Wilcoxon Mann-Whitney, Kolmogorov-Smirnov, etc). Don't hesitate to ask for help with the statistical tests if the need arises.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by jbullock35 (Hermit) on Apr 28, 2007 at 10:35 UTC

    If you know R, it's pretty easy to call it from Perl and get the results that you need. You don't need Statistics::R. It's much easier to just adapt the code that tmoertel provided in this post -- I've done similar things several times.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by andye (Curate) on Apr 30, 2007 at 14:22 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://612443]
Approved by shigetsu
Front-paged by shigetsu
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2017-07-21 02:03 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (317 votes). Check out past polls.