http://www.perlmonks.org?node_id=773831

Update: Fixed to address the actual test used, which I originally misread, from which what they calculate is numerically accurate.

A lot of attention has been garnered by the Washington Post article with a statistical argument that the elections in Iran were a fraud. I replicate part of it below with my critique of what it all means.

Their argument goes like this: A random draw from the digits 0-9 yield a 10% probability of picking any single digit. In the election results the digit 5 occurred as the last digit 4% of the time while the digit 7 similarly occurred 17% of the time. (Apparently this also had some psychological significance.)

"Fewer than four in a hundred non-fraudulent elections would produce such numbers."

A testable assertion! Onto the Perl: (note that the election results had 116 observations)

#!/usr/bin/perl use strict; use warnings; use Statistics::Descriptive; # 1. Simulate 10,000 draws of 116 obs from a random distribution betwe +en 0 and 9. # 2. Calculate: # - the odds one digit occurs 5 or less times (4% of 116) # - the odds one digit occurs 20 or more times (17% of 116) # - the mean and sd -> test 5 and 20 are outside the 95% CI # - the odds both occur my $RUNS = 10_000; my ($FIVES,$TWENTIES,$BOTH) = (0,0,0); my @SAMPLE; my $stat = Statistics::Descriptive::Full->new(); # Collect for my $i ( 1..$RUNS ) { my %h; $h{int(rand(10))}++ for (1..116); my ($old5,$old20) = ($FIVES,$TWENTIES); for ( values %h ) { # $stat->add_data($_); push @SAMPLE, $_; $FIVES++ if $_ <= 5; $TWENTIES++ if $_ >= 20; } $BOTH++ if $old5!=$FIVES and $old20!=$TWENTIES; } $stat->add_data(@SAMPLE); # Analyze printf "Mean:\t\t\t%.2f\nSD:\t\t\t%.3f\n",$stat->mean,$stat->standard_ +deviation; printf "Odds of 5 or less:\t%.3f\n",$FIVES/$RUNS; printf "Odds of 20 or higher:\t%.3f\n",$TWENTIES/$RUNS; printf "95 percent CI:\t\t%.3f --- %.3f\n", $stat->mean - 2.96 * $stat->standard_deviation, $stat->mean + 2.96 * $stat->standard_deviation; printf "Odds of both:\t\t%.3f\n",$BOTH/$RUNS;

Typical Output:

Mean: 11.60 SD: 3.231 Odds of 5 or less: 0.204 Odds of 20 or higher: 0.112 95 percent CI: 2.037 --- 21.163 Odds of both: 0.037
So, from a uniform distribution between 0-9 of 116 random draws you would expect to find one digit occurring 4% of the time or fewer in over 20% of the cases. The odds of a digit occurring 17% of the time or higher is half as frequent yet still comfortably inside the 95% confidence interval. We fail to reject the null hypothesis of both individual tests at the 5% level, therefore "disproving" the "proof" but the odds of both happening simultaneously are 3.7%, which rejects the null hypothesis of a random draw in a 95% confidence interval. Throwing in their last test of adjacent numbers (not coded) moves the frequency to 0.5%.

The fact remains they used arbitrary tests to arrive at this number - you would have to believe each psychological justification to say it bears any significance. It also reeks of data mining - they omit to tell us if they tested other bits of psychological trivia that happened to turn out non-significant. If they did then their final likelihood assessment - 1 in 200 - is invalid, and they should have instead pooled all of their tests, significant or not.

Election fraud is a serious charge and one that should be made with stronger evidence than a few minor statistical anomalies based on flimsy ad-hoc reasoning. Analyses based on exit polling data, for example, are much more sound - if systematic anomalies are observed you either have to reject the polling methodology (sample bias, eg) or question the election results.