Assessing a statistical argument on the fraudulance of the Iranian elections

Update: Fixed to address the actual test used, which I originally misread, from which what they calculate is numerically accurate.

A lot of attention has been garnered by the Washington Post article with a statistical argument that the elections in Iran were a fraud. I replicate part of it below with my critique of what it all means.

Their argument goes like this: A random draw from the digits 0-9 yield a 10% probability of picking any single digit. In the election results the digit 5 occurred as the last digit 4% of the time while the digit 7 similarly occurred 17% of the time. (Apparently this also had some psychological significance.)

"Fewer than four in a hundred non-fraudulent elections would produce such numbers."

A testable assertion! Onto the Perl: (note that the election results had 116 observations)

#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Descriptive;

# 1. Simulate 10,000 draws of 116 obs from a random distribution betwe
+en 0 and 9. 
# 2. Calculate:
#   - the odds one digit occurs 5 or less times (4% of 116)
#   - the odds one digit occurs 20 or more times (17% of 116)
#   - the mean and sd -> test 5 and 20 are outside the 95% CI
#   - the odds both occur

my $RUNS = 10_000;
my ($FIVES,$TWENTIES,$BOTH) = (0,0,0);
my @SAMPLE;
my $stat = Statistics::Descriptive::Full->new();

# Collect
for my $i ( 1..$RUNS ) {
    my %h;
    $h{int(rand(10))}++ for (1..116);
    my ($old5,$old20) = ($FIVES,$TWENTIES);
    for ( values %h ) {
#        $stat->add_data($_);
        push @SAMPLE, $_;
        $FIVES++ if $_ <= 5;
        $TWENTIES++ if $_ >= 20;
    }
    $BOTH++ if $old5!=$FIVES and $old20!=$TWENTIES;
}
$stat->add_data(@SAMPLE);

# Analyze
printf "Mean:\t\t\t%.2f\nSD:\t\t\t%.3f\n",$stat->mean,$stat->standard_
+deviation;
printf "Odds of 5 or less:\t%.3f\n",$FIVES/$RUNS;
printf "Odds of 20 or higher:\t%.3f\n",$TWENTIES/$RUNS;
printf "95 percent CI:\t\t%.3f --- %.3f\n",
    $stat->mean - 2.96 * $stat->standard_deviation,
    $stat->mean + 2.96 * $stat->standard_deviation;
printf "Odds of both:\t\t%.3f\n",$BOTH/$RUNS;
[download]

Typical Output:

Mean:                   11.60
SD:                     3.231
Odds of 5 or less:      0.204
Odds of 20 or higher:   0.112
95 percent CI:          2.037 --- 21.163
Odds of both:           0.037
[download]

So, from a uniform distribution between 0-9 of 116 random draws you would expect to find one digit occurring 4% of the time or fewer in over 20% of the cases. The odds of a digit occurring 17% of the time or higher is half as frequent yet still comfortably inside the 95% confidence interval. We fail to reject the null hypothesis of both individual tests at the 5% level, ~~therefore "disproving" the "proof"~~ but the odds of both happening simultaneously are 3.7%, which rejects the null hypothesis of a random draw in a 95% confidence interval. Throwing in their last test of adjacent numbers (not coded) moves the frequency to 0.5%.

The fact remains they used arbitrary tests to arrive at this number - you would have to believe each psychological justification to say it bears any significance. It also reeks of data mining - they omit to tell us if they tested other bits of psychological trivia that happened to turn out non-significant. If they did then their final likelihood assessment - 1 in 200 - is invalid, and they should have instead pooled all of their tests, significant or not.

Election fraud is a serious charge and one that should be made with stronger evidence than a few minor statistical anomalies based on flimsy ad-hoc reasoning. Analyses based on exit polling data, for example, are much more sound - if systematic anomalies are observed you either have to reject the polling methodology (sample bias, eg) or question the election results.

Back to Cool Uses for Perl