Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Stats: Testing whether data is normally (Gaussian) distributed

by andye (Curate)
on Apr 27, 2007 at 19:12 UTC ( #612443=perlquestion: print w/ replies, xml ) Need Help??
andye has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have some data. I'd like to check whether each of my datasets is normally distributed.

Because of virtues 1 and 2, I'd like to find a module that implements one of the normality test algorithms, rather than doing it myself.

I've looked on CPAN but can't find anything suitable. The R library has some suitable routines in nortest (PDF) but I'm not looking forward to trying to use those from Perl (Statistics::R) or writing a C wrapper to use them from PDL (PDL::PP).

Or of course I could just code the routine myself... my stats is pretty rusty though and there's a fair chance I'd screw it up somehow.

Does anyone have any easier options for me?

Best wishes, and happy weekend to one and all,
andye

Comment on Stats: Testing whether data is normally (Gaussian) distributed
Replies are listed 'Best First'.
Re: Stats: Testing whether data is normally (Gaussian) distributed
by lin0 (Curate) on Apr 27, 2007 at 20:11 UTC
      Currently the ChiSquare module assumes that the data you're testing is meant to be evenly distributed. I've wanted to make that configurable for a long time, so if anyone can come up with a nice way of doing it, I would be delighted to apply your patch.

      I also need to hack on it to support more degrees of freedom.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by moklevat (Priest) on Apr 27, 2007 at 21:25 UTC
    Hi andye,

    It looks like it may be a trade-off between calling R or coding the algorithm yourself, and a smashing opportunity to write a module:-). Among the options for tests of normality you might consider Shapiro-Wilk, (there is a link to a Fortran version of the algorithm in the Wikipedia article), and the popular Kolmogorov-Smirnov test. The K-S is generalizable to many distributions, but may be more of a pain to implement than Shapiro-Wilk. I would recommend steering clear of the Anderson-Darling test, as it is overly sensitive with sample sizes greater than about 25 (as mentioned in Wikipedia).

    If you do roll your own, PDL is great for stats.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by DigitalKitty (Parson) on Apr 27, 2007 at 22:10 UTC
    Hi andye.

    I'm not sure if you will derive significant benefit from my suggestion but I felt compelled to offer:

    The OmegaHat Project

    If you find that your data is not normally distributed, you could use a non-parametric test (e.g. Kruskal-Wallis, Wilcoxon Mann-Whitney, Kolmogorov-Smirnov, etc). Don't hesitate to ask for help with the statistical tests if the need arises.

    Thanks,
    Katie
Re: Stats: Testing whether data is normally (Gaussian) distributed
by jbullock35 (Hermit) on Apr 28, 2007 at 10:35 UTC

    If you know R, it's pretty easy to call it from Perl and get the results that you need. You don't need Statistics::R. It's much easier to just adapt the code that tmoertel provided in this post -- I've done similar things several times.

Re: Stats: Testing whether data is normally (Gaussian) distributed
by andye (Curate) on Apr 30, 2007 at 14:22 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://612443]
Approved by shigetsu
Front-paged by shigetsu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2015-07-29 07:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (260 votes), past polls