sourcecode
tlm
<code>
=head1 NAME
Statistics::FET - Fisher's Exact Test statistic (2x2 case)
=head1 SYNOPSIS
use Statistics::FET 'fishers_exact';
=head1 DESCRIPTION
This module exports only one function, C<fishers_exact>, which
computes the one- or two-sided Fisher's Exact Test statistic for the 2
x 2 case. In the following documentation I will be referring to the
following family of 2 x 2 contingency tables
* | * | r1 = a+b
--------+-------+----------
* | * | r2 = c+d
--------+-------+----------
c1 | c2 | N
= a+c | = b+d | = a+b+c+d
The *'s mark the cells, N is the total number of points represented by
the table, and r1, r2, c1, c2 are the marginals. As suggested by the
equalities, the letters a, b, c, d refer to the various cells (reading
them left-to-right, top-to-bottom). Depending on context, the letter
c (for example) will refer *either* to the lower left cell of the
table *or* to a specific value in that cell. Same for a, b, and d.
In what follows, H(x) (or more precisely, H(x; r1, r2, c1)) refers
to the hypergeometric expression
r1!*r2!*c1!*c2!
-------------------------------------
(r1+r2)!*x!*(r1-x)!*(c1-x)!*(r2-c1+x)!
(I omit c2 from the parametrization of H because c2 = r1 + r2 - c1.)
=head1 FUNCTION
=over 4
=item fishers_exact( $a, $b, $c, $d, $two_sided )
The paramater C<$two_sided> is optional. If missing or false
C<fishers_exact> computes the one-sided FET p-value. More
specifically, it computes the sum of H(x; a+b, c+d, a+c) for x = a to
x = min(a+b, a+c). (If you want the sum of H(x; a+b, c+d, a+c) for x
= max(0, a-d) to x = a, then compute C<fishers_exact( $b, $a, $d, $c )> or
C<fishers_exact( $c, $d, $a, $b )> (these two are equivalent).)
If C<$two_sided> is true, the returned p-value will be for the two-sided
FET.
=back
=cut
use strict;
use warnings;
package Statistics::FET;
use Exporter 'import';
use Math::Pari ();
my $Tolerance = 1;
$Tolerance /= 2 while 1 + $Tolerance/2 > 1;
@Statistics::FET::EXPORT_OK = 'fishers_exact';
sub fishers_exact {
my ( $a, $b, $c, $d, $ts ) = @_;
my $test = $a*( $a + $b + $c + $d ) - ( $a + $b )*( $a + $c );
return 1 if $test < 0 and $ts;
# below here, $test < 0 implies !$ts;
my $p_val;
if ( $test < 0 ) {
if ( $d < $a ) {
$p_val = _fishers_exact( $d, $c, $b, $a, $ts, 1 );
}
else {
$p_val = _fishers_exact( $a, $b, $c, $d, $ts, 1 );
}
}
else {
if ( $b < $c ) {
$p_val = _fishers_exact( $b, $a, $d, $c, $ts, 0 );
}
else {
$p_val = _fishers_exact( $c, $d, $a, $b, $ts, 0 );
}
}
return $p_val;
}
sub _fishers_exact {
my ( $a, $b, $c, $d, $ts, $complement ) = @_;
die "Bad args\n" if $ts && $complement;
my ( $aa, $bb, $cc, $dd ) = ( $a, $b, $c, $d );
my $first = my $delta = _single_term( $aa, $bb, $cc, $dd );
my $p_val = 0;
{
$p_val += $delta;
last if $aa < 1;
$delta *= ( ( $aa-- * $dd-- )/( ++$bb * ++$cc ) );
redo;
}
if ( $ts ) {
my $m = $b < $c ? $b : $c;
($aa, $bb, $cc, $dd ) = ( $a + $m, $b - $m, $c - $m, $d + $m );
$delta = _single_term( $aa, $bb, $cc, $dd );
my $bound = -$Tolerance;
while ( $bound <= ( $first - $delta )/$first && $aa > $a ) {
$p_val += $delta;
$delta *= ( ( $aa-- * $dd-- )/( ++$bb * ++$cc ) );
}
}
elsif ( $complement ) {
$p_val = 1 - $p_val + $first;
}
return $p_val;
}
sub _single_term {
my ( $a, $b, $c, $d ) = @_;
my ( $r1, $r2 ) = ($a + $b, $c + $d);
my ( $c1, $c2 ) = ($a + $c, $b + $d);
my $N = $r1 + $r2;
return exp( _ln_fact( $r1 ) + _ln_fact( $r2 ) +
_ln_fact( $c1 ) + _ln_fact( $c2 ) -
_ln_fact( $N ) -
( _ln_fact( $a ) + _ln_fact( $b ) +
_ln_fact( $c ) + _ln_fact( $d ) ) );
}
{
my $two_pi;
my $pi_over_3;
my $half;
BEGIN {
$two_pi = Math::Pari::PARI( 2 * atan2 0, -1 );
$pi_over_3 = Math::Pari::PARI( atan2( 0, -1 )/3 );
$half = Math::Pari::PARI( 0.5 );
}
sub _ln_fact {
my $n = Math::Pari::PARI( shift() );
die "Bad args to _ln_fact: $n" if $n < 0;
my $ln_fact;
eval {
$ln_fact = log Math::Pari::factorial( $n );
};
if ( $@ ) {
die $@ unless $@ =~ /\QPARI: *** exponent overflow/;
# Gosper's approximation; from
# http://mathworld.wolfram.com/StirlingsApproximation.html
$ln_fact = $half * log( $two_pi*$n + $pi_over_3 )
+ $n * log( $n )
- $n;
}
return $ln_fact;
}
}
__END__
</code>
<p>See documentation with the code.</p>
<p>Note that the code computes factorials in only one place, the internal subroutine <tt>_single_term</tt>, which computes the hypergeometric probability function for a 2 x 2 contingency table (specified through its cell values, a, b, c, and d). Furthermore, the only dependance of the main routine (<tt>fishers_exact</tt>) on the Math::Pari module is through <tt>_single_term</tt>. If you prefer a library other than Math::Pari or don't like my implementation of <tt>_single_term</tt>, just replace it with your own. The rest of the code should work fine.</p>
<p>Also note that <tt>_single_term</tt> is called at most twice for every call to <tt>fishers_exact</tt>, so relegating this computation to a separate function doesn't significantly add to <tt>fishers_exact</tt>'s overhead.</p>
<p>The computation of factorials attempts to use [mod://Math::Pari]::factorial, and switches over to Gosper's approximation if that fails. Although, in principle, this maximizes the algorithm's accuracy, the performance cost may not justify this policy. For example, for 10_000!, the relative difference between the two methods is < 1e-10, but Gosper's is much faster:
<c>
Rate pari gospers
pari 604/s -- -94%
gospers 9657/s 1500% --
</c>
The cost/performance only gets worse for larger factorials.</p>
<p>On the other hand, if you find yourself working in this regime, ask yourself whether χ<sup>2</sup> isn't more suited to your needs than than the FET.</p>
<p>Needless to say, I am providing this code "as is", without any expressed or implied warranties whatsoever. Use at your own risk.</p>
<small><p><b>Update:</b> Deleted a few vestigial lines of code. Thanks to [BrowserUk]++ for pointing them out.</p></small>
Miscellaneous
/msg me