Re: standard deviation accuracy question
by NetWallah (Canon) on Jul 22, 2005 at 13:47 UTC

 [reply] 

Just to get this info in PM for future lookups in case that link dies:
1. The binomial standard deviation
The binomial standard deviation applies to events with two outcomes: win or lose. For example, betting on heads in coin tossing can lead to win (the appearance of heads) or loss (the appearance of the opposite; tails, in this case). The binomial standard deviation is calculated by the following formula:
Standard deviation = SQR{(N*p*(1p)}
That is, the square root of: the number of trials (events) N, multiplied by the probability p, multiplied by the opposite probability (or 1 minus p).
(where p is the probability of appearance and N represents the number of trials).
Suppose we toss a coin 100 times (N=100). The probability of heads is p=1/2=0.5. The standard deviation is SQR{100 * 0.5 * 0.5} = SQR(100 * .25) =SQR(25) = 5. The expected number of heads in 100 tosses is 0.5 * 100 = 50. The rule of normal probability proves that in 68.2% of the cases, the number of heads will fall within one standard deviation from the number of expected successes (50). That is, if we repeat 1000 times the event of tossing a coin 100 times, in 682 cases we'll encounter a number of heads between 45 and 55.
2. The statistical standard deviation
There is no formula to calculate the statistics standard deviation directly (?) That's what they told you in school. That's what they say in other public places with the selfproclaimed goal of education. Only an algorithm can lead to the standard deviation of a data series. Indeed, the algorithm is always available. The following are the steps of the algorithm implemented in my freeware Super Formula. Sum up data; calculate the mean average (sum total divided by the number of elements); deduct each element of the collection from the average; raise each difference to the power of 2; add up the squared differences; divide the new sum total by the number of elements in the data series; the result represents the variance; the square root of the variance represents the famous standard deviation.
A data series like 1, 2, 3, 6 has a mean average (mu) equal to:
? = (1+2+3+6)/4=3.
The differences from the mean are: 2, 1, 0, +3. The variance (sigma squared) is the measurement of the squared deviations. The variance is calculated as:
?˛ = {(2)2 + (1)2 + 0 + 32}/4=14/4=3.5.
Finally, the standard deviation (sigma) is equal to the positive square root of the variance:
? = SQR(3.5)=1.87.
Nevertheless, there are formulae (plural, indeed) to calculate the statistical deviation in advance. There is a dominant deviation parameter in all the stochastic (probabilistic) events. In fact, all events are stochastic, since randomness is present in everythingthereis. Nothingthereis can exist with absolute certainty (see the mathematics of the absurdity of absolute certainty: www.saliu.com/formula.htm). The elements of a stochastic phenomenon deviate from one another following mathematical rules. The difference is in the probability of the event (phenomenon). The probability then determines subsequent parameters, such as volatility, standard deviation, FFG deviation, etc.
In 2003 I announced that I had discovered a formula for a very important measure in the fluctuation of probability events: FFG deviation. See “New pairing research”. Soon thereafter I have been bombarded with requests to present the formula for FFG deviation and the statistical standard deviation. Of course, I was asked (in strong terms sometimes) to release also free software to accompany the formulae calculations. The requests were also presented in public forums, sometimes strongly worded.
At this time, I do not publish the formulae to calculate the FFG deviation and the statistical standard deviation. Such an act would serve people I do not want to serve. They belong to the following categories: gambling developers and high rollers; lottery systems and software developers; stock traders. I know exactly whom I am talking about. I have received many a message from them. They inundated my hou' with correspondence, including postal mail. They would be the ones that would charge serious money out of my effort. The vast majority of people do not really need to know exactly all the formulas involved in standard deviation calculations. Suffice to say that my software does incorporate standard deviation calculations. Also, the greatest random number/combination generator — IonSaliuGenerator — makes extraordinarily good usage of the standard deviation.
 [reply] 
Re: standard deviation accuracy question
by thor (Priest) on Jul 22, 2005 at 13:52 UTC

Your function is doing it right. From reading the documentation on Excel's STDEV function, it says that it "estimates the standard deviation based on a sample"...whatever the hell that means. In perusing the functions a little further, there's a function called STDEVP whose description is "calculates the standard deviation based on the entire population given as arguments". Seems like that should be the standard one (no pun intended), but that's the way it goes.
thor
Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come
 [reply] 

The sample standard deviation divides by N1. The population standard deviation divides by N. The N  1 version is used more often, because in real life you are working with samples. That's why it is the default in Excel.
 [reply] 

Maybe I'm being dense, but if you you know all of the members in a set, why would you want to use a number that's nonrepresentative of that set? If you wouldn't mind, a concrete example might be of assistance. I really want to understand this...
thor
Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come
 [reply] 




Re: standard deviation accuracy question
by dave_the_m (Monsignor) on Jul 22, 2005 at 13:33 UTC

When you say you get different results, do you mean slightly different results, or wildly different results?
Or to put it another way, is this thread to be a discussion of the details of accumulated rounding errors and limits of precision, or a discussion of what is the correct algorithm to calculate std deviations?
Can you supply a short list of sample data and the output you get with your script and with excel?
Dave.  [reply] 
Re: standard deviation accuracy question
by jpeg (Chaplain) on Jul 22, 2005 at 14:06 UTC

How are you calcing standard deviation in Excel? There's STDEV, STDEVA, STDEVP, STDEVPA, off the top of my head.
Excel's STDEVP uses the n1 as the denominator, so you should probably use
my $elements = $#ar;
just to compare apples and apples.
 [reply] [d/l] 
Re: standard deviation accuracy question
by tye (Sage) on Jul 22, 2005 at 13:46 UTC

It would probably help a lot if you didn't refuse to tell us what the different values are.
Try using my $elements = @$arr  1;. I believe I've seen "N1" optionally used in the demoninator. I also suspect that $_*$_ will be more accurate than $_**2, but I doubt that would matter enough to cause you to write a node.
 [reply] [d/l] 
Re: standard deviation accuracy question
by davidrw (Prior) on Jul 22, 2005 at 14:36 UTC

Instead of writing your own, can you fix or work around the ppm problem? Perhaps packaging with PAR? Or can you at least just copy the code from Math::StdDev and include it in your source?  [reply] 
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:02 UTC

condition 1 49.56
condition 2 43.31
condition 3 50.85
condition 4 55.09
the values excel gave
condition 1 50.73
condition 2 44.33
condition 3 52.05
condition 4 56.39
the raw data
cond1 cond2 cond3 cond4
397 360.5 372 324
322 318.5 354 305.5
350 356 358 364
346 316.5 311 320
338 330 336.5 327.5
324 296.5 297 288.5
461 444 485 432.5
477 474 461 434
432 416.5 407.5 447
373 355 365 369
386 320 342 315
412 401.5 428 397.5
475 363.5 494 515
360.5 332 343.5 339
384 363 372 405
324 324 324 330
424 388 379.5 396
319 312 345 319
454.5 344 389 345
412 372 384 381.5
387 347.5 363 340
370 338 346 317
 [reply] [d/l] [select] 

And furthering the work of esteemed monk tlm above, the difference is explained completely by whether or not your final step divides by (N1) or by N.
Data Set Div(N1) Div N
  
condition1 50.73 49.56
condition2 44.34 43.32
condition3 52.05 50.85
condition4 56.39 55.10
Update: Which is exactly what the following posts (at least) had attempted to tell you: Re^2: standard deviation accuracy question Re: standard deviation accuracy question and Re^2: standard deviation accuracy questionScott
 [reply] [d/l] 

condition1 50.73
condition2 44.34
condition3 52.05
condition4 56.39
 [reply] [d/l] [select] 
Re: standard deviation accuracy question
by tlm (Prior) on Jul 22, 2005 at 16:07 UTC

The code you posted does not compute the sample standard deviation, which is what Excel is computing. You need to make the N – 1 correction. (I.e. before taking the square root, multiply everything by N/(N–1)).
Or you can use the code I posted in my other reply in this thread (though yours, with the correction, is probably more efficient).
 [reply] 
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 14:54 UTC

ok
i am trying to compute the sample standard deviation and i am comparing to excels STDEV function (which divides by N  1 giving the samples standard deviation)
my values aren't masssivly different from those given by excel but enough to be worrying
i believe the algorithm is right i'm just supprised if a precision difference could cauze the differences i'm seeing i'll post the data in a sec
probably getting round the broken ppm is the answer there are so many other things i need to install modules for anyway but i got used to writing my own bits for portability around this crazy network  [reply] 

the formula you're using in the original post divides by N. Of course you'll be getting different results.
 [reply] 
Re: standard deviation accuracy question
by radiantmatrix (Parson) on Jul 22, 2005 at 17:40 UTC

As far as nonfunctional PPMs go, I would suggest finding a dev machine with ActiveState Perl and a functioning PPM, then installing Statistics::Basic::StdDev and its dependencies. You can then copy the C:\Perl\site\lib\ file tree to your target machine. All the modules in question are pure Perl, so you shouldn't have any issues doing this.
Once you have that module in place, you'll be in a better position to determine the significance of differences between Excel and Perl's output.
<radiant.matrix>
Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots"  Kaa's Law
 [reply] 
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:18 UTC

does anyone know why i might be getting different results to these commercially available applications?
Are you suffering from numerical roundoff problems? How many elements are in the array? How big are the individual elements compared to $sum and $sumsq? One way to test for numerical errors is to plot a graph of the standard deviation for various numbers of $elements. If the graph looks like it converges to a value, you're OK, but if the answer plot starts to "get fuzzy" (i.e. more random) you probably have problems.
 [reply] 
Re: standard deviation accuracy question
by spiritway (Vicar) on Jul 22, 2005 at 20:29 UTC

I had a similar problem and it drove me nuts... Your code looks OK. You didn't specify what the discrepancies were, but one thing to look into is whether your standard deviation is for a sample, or for the population. There is a difference. For a sample, you divide by (n1); for a population, you divide by n (where n = the number of elements).
 [reply] 
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:10 UTC

condition1
397
322
350
346
338
324
461
477
432
373
386
412
475
360.5
384
324
424
319
454.5
412
387
370
condition2
360.5
318.5
356
316.5
330
296.5
444
474
416.5
355
320
401.5
363.5
332
363
324
388
312
344
372
347.5
338
condition3
372
354
358
311
336.5
297
485
461
407.5
365
342
428
494
343.5
372
324
379.5
345
389
384
363
346
condition4
324
305.5
364
320
327.5
288.5
432.5
434
447
369
315
397.5
515
339
405
330
396
319
345
381.5
340
317
 [reply] [d/l] 