http://www.perlmonks.org?node_id=466322

Commander Salamander has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm rather new to perl, and have been trying to impelement a calculation of hypergeometric distribution probabilities into some code. For those of you that aren't familiar, these statistics address the following type of question:

"If I have a box with 300 white balls and 700 black balls and I draw 100 balls from the box, what are the odds that I draw 40 or more white balls".

I've encountered two (related) problems:

1 These calculations require very large numbers, apparently necessitating the use of a module such as BigInt or BigFloat (though I don't fully understand at what size number these modules become necessary). By playing with someone else's publicly available code and clumsily implementing this functionality, I was able to get things to work... however:

2 Especially when I'm using very large samples, the process of calculating the probability of is excruciatingly slow due to the large number of factorial calculations necessary.

I would greatly appreciate advice on which "big number" module is the most efficient for factorials, and whether there is any conventional wisdom as to how I could best tackle this problem.

Thanks!
-ollie-
• Comment on Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?

Replies are listed 'Best First'.
Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by Fletch (Chancellor) on Jun 14, 2005 at 00:04 UTC

There's an interface to the GNU Scientific Library, Math::Gsl, which has several distribution functions. The underlying C library is (I think; I may be recalling anecdotal evidence here) supposedly fairly good.

--
We're looking for people in ATL

Also, the Math::Gsl::Sf has a gamma function, which may speed up calculations (I'm not sure if it would or not, I haven't actually studied the inner workings of the gamma function). The gamma function of any whole number is equal to that number's factorial, ie gamma(300) == 300!, which is why I think this function may be faster than a factorial, since there probably aren't 300 multiplications, I bet it's just some sort of integral, which may have quite a few calculations, but not 300 large multiplications.

-Bryan

Good idea, but no cigar.

$perl -MMath::Gsl::Sf=:Gamma -e'print gamma(300)' gsl: gamma.c:1111: ERROR: overflow Default GSL error handler invoked. Aborted (core dumped)$
[download]
The GSL doesn't handle large numbers.

There's also a math error to correct: gamma($n) == factorial($n - 1)

After Compline,
Zaxo

Hi again

I'm having serious trouble installing Math::Gsl through CPAN. If this isn't too far afield for this board, I'd really appreciate your thoughts on where I went astray. Before I get into those error messages, I'd like to provide some details on how I installed gsl itself (I am rather ignorant about compiling programs and I may have messed up at this stage).

I installed gsl-1.4 in my user directory (while logged in as root) of a machine running Fedora Core 3 by blindly following the instructions in the install file and executing the following commands:

1 gzip -d < gsl-1.4.tar.gz | tar xf -
2 ./configure ### the readout here was essentially gibberish to me... it took quite some time
3 make
4 make check < log2<&1 ### I checked the log file, and all tests appeared to PASS
5 make install
As far as I know... this worked, but I don't know how to explicitly test this.

Upon entering CPAN and attempting an install of Math-Gsl-0.8, I encounter the following errors during the make test:

1 t/1....Can't load '/root/.cpan/build/Math-Gsl-0.08/blib/arch/auto/Math/Gsl/Gsl.so' for module Math::Gsl: l ibgsl.so.0: cannot open shared object file: No such file or directory at /usr/lib/perl5/5.8.5/i386-linux-t hread-multi/DynaLoader.pm line 230. at t/1.t line 7 Compilation failed in require at t/1.t line 7.

2 t/2....Can't load '/root/.cpan/build/Math-Gsl-0.08/blib/arch/auto/Math/Gsl/Polynomial/Polynomial.so' for m odule Math::Gsl::Polynomial: libgsl.so.0: cannot open shared object file: No such file or directory at /us r/lib/perl5/5.8.5/i386-linux-thread-multi/DynaLoader.pm line 230. at t/2.t line 3 Compilation failed in require at t/2.t line 3.

3 Failed Test Stat Wstat Total Fail Failed List of Failed
---------------------------------------------------------------------
t/1.t 255 65280 10 20 200.00% 1-10
t/2.t 255 65280 21 42 200.00% 1-21
Failed 2/2 test scripts, 0.00% okay. 31/31 subtests failed, 0.00% okay.
make: *** test_dynamic Error 255
/usr/bin/make test -- NOT OK

I'm guessing that the installation is unable to access Gsl, but I have no idea how to remedy the problem. Any suggestions would be much appreciated.
Thanks!

First thing to check is did perl Makefile.PL complain about a missing library? Going by the paths this is some flavour of Linux, so the next steps to check are:

• Presuming the usual defaults for something using autoconf, gsl probably installed itself under /usr/local; check that /usr/local/lib contains a libgsl.so.0
• Try explicitly setting export LD_LIBRARY_PATH=/usr/local/lib in your shell and then see if the tests run
• If that works, check that /usr/local is in /etc/ld.so.conf and rerun ldconfig -v and that should pick up the new library (even if it is, if you haven't regenerated the cache ld.so could miss the new library)

If that doesn't get you going, you're at the point where you need to bring in your friendly neighborhood sysadmin and bribe him with some combination of pizza|skittles|beer.

--
We're looking for people in ATL

Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by chas (Priest) on Jun 14, 2005 at 00:42 UTC
If you just want a reasonable approximation (and not all the decimal places) you could use Stirling's approximation:
n! ~ (n^n)e^{-n}(\sqrt{(2)(pi)(n)})
This is an asymptotic formula (i.e. the ratio of n! to the Stirling approximation tends to 1 as n tends to infinity - it isn't true that n! is actually close to this approximation) but used on ratios of factorials of the type you need it actually gives good approximations of the values. That's certainly what I'd do if I were writing such code.
(Even if you could get an exact value, do you realy need 173 decimal places, or whatever...in many cases I would guess 2 decimal places might be pretty reasonable...for a probability.)
chas
(Update: Of course, in cases where the calculation doesn't really involve a very large number of multiplications, I would do the arithmetic directly.)
Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by BrowserUk (Pope) on Jun 14, 2005 at 01:05 UTC

Do you need total precision? Using one of the 'BigMath' modules would allow you to get complete accuracy, but calculating large factorials will consume large amounts of memory and take a long time. Using one of the C-based math packages will calculate a lot faster but will discard some accuracy. By way of example, the following both produce the same results to 30+ significant digits, which as your probabilities will be in the range 0->1 or percentanges, you're probably only interested in 2 or 3 digits of accuracy in the final result.

Using Math::Pari, factorial( 19937 ) takes a second or so:

[ 1:32:28.64] P:\test>perl -MMath::Pari=factorial -wle"print factorial
+( 19937 )"
2.174930234150431374566653166E77066

[ 1:32:30.58] P:\test>
[download]

Using Math::BigInt takes over 100 seconds to do the same thing:

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by kaif (Friar) on Jun 14, 2005 at 01:34 UTC

One more thing you may wish to consider is using algorithms that do not rely on large integers. For example, instead of computing ncr( $n,$r ) ("n choose r") by the formula $n! / ($n! * ($n-$r)!), use the following approach:

sub ncr {
my( $n,$r ) = @_;
my( $i,$result );

+
+
$result = 1; for$i ( 1 .. $r ) {$result *= $n -$r + $i;$result /= $i; } return$result;
}

+
+
print "ncr( 6, 3 ) = ", ncr( 6, 3), "\n";
# ncr( 6, 3 ) = 20
[download]

That is, arrange your multiplications and divisions in such a way that at any given point of time you have an integer not much larger than the output.

Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by aquarium (Curate) on Jun 14, 2005 at 00:27 UTC
i can't remember the module name, but what it does is remember previous results....this greatly speeds up calculations of factorials. e.g. when you are working out 7 factorial, it already remembers the result of 6!, 5!, 4!, 3!, 2!, 1!. look around in the math modules on CPAN
the hardest line to type correctly is: stty erase ^H
Memoize

thor

Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come

Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by tlm (Prior) on Jun 14, 2005 at 01:25 UTC

In this node I posted some code to compute a one-tailed Fisher's Exact test, which uses a hypergeometric distribution. The implementation uses Math::Pari.

BTW, do you really mean "the odds"? Or do you want the probability p? If the latter is the case, then this probability is equivalent to the P-value obtained from the one-tailed Fisher's Exact test computed with the code referred to above. (The odds are (1– p)/p.)

the lowliest monk

BTW, do you really mean "the odds"? Or do you want the probability p? [...] (The odds are (1– p)/p.)

Hmm, if the probability is 25%, I'd read that as implying "the odds are 3". But that isn't right: the odds are "3 to 1 against", which means the same as 25% - in fact I'd consider it reasonable even to say "the odds are 25%".

I'm not sure if there is a name for what (1-p)/p represents, but I don't think "the odds" is it.

Hugo

In the example you gave the odds are 3:1 (or 3, if one follows the common practice viewing this as a ratio), but decidedly not 25%. Granted, in colloquial speech, "odds" and "probability" are often used interchangeably, but technically odds and probabilities are "interchangeable" only in the limited sense that one can derive one from the other. They mean different things. See here.

Update: In response to chas's comment, I replaced the link to MathWorld's definition (which was, indeed, well, odd) with a link to a clearer explanation of odds vs. probabilities.

the lowliest monk

Probability p is expressed in terms of odds is "(1-p)/p to 1 against". Notice that if p=1/4 this gives "3 to 1 against." One can also compute "odds in favor" of an event, although "odds against" seems more common. (I'm not sure why that is, but suspect it may be that gamblers often like to bet on something with big odds against which then has a large payoff.) But you are correct that "the odds are 3" isn't really meaningful although I understood what tlm meant.
chas
Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by tall_man (Parson) on Jun 14, 2005 at 15:19 UTC
You should compute log gamma. There are formulas that compute it very quickly and accurately, for example: Malloc with Inline::C. By the way gamma(n+1) = n! for positive integer n. (Some responders said gamma(n) = n!, which is wrong).

Update: Here is some code (perl version of gammln is from Re^4: Challenge: Chasing Knuth's Conjecture):

#!/usr/bin/perl -w
use strict;

sub logfact {
return gammln(shift(@_) + 1.0);
}

sub hypergeom {
# There are m "bad" and n "good" balls in an urn.
# Pick N of them. The probability of i or more successful selection
+s:
# (m!n!N!(m+n-N)!)/(i!(n-i)!(m+i-N)!(N-i)!(m+n)!)
my ($n,$m, $N,$i) = @_;

my $loghyp1 = logfact($m)+logfact($n)+logfact($N)+logfact($m+$n-$N) +; my$loghyp2 = logfact($i)+logfact($n-$i)+logfact($m+$i-$N)+logfact(
+$N-$i)+logfact($m+$n);
return exp($loghyp1 -$loghyp2);
}

sub gammln {
my $xx = shift; my @cof = (76.18009172947146, -86.50532032941677, 24.01409824083091, -1.231739572450155, 0.12086509738661e-2, -0.5395239384953e-5); my$y = my $x =$xx;
my $tmp =$x + 5.5;
$tmp -= ($x + .5) * log($tmp); my$ser = 1.000000000190015;
for my $j (0..5) {$ser += $cof[$j]/++$y; } -$tmp + log(2.5066282746310005*$ser/$x);
}

print hypergeom(300,700,100,40),"\n";
[download]
I only just ran across this node, thanks everyone!

For pure speed then you can inline the gammln code, unroll the loop into one big equation, and move the constants into the equation instead of referencing them indirectly in the array. I also got rid of $y for a very small speedup. This code runs about 135% faster than logfact above (i.e. over twice as fast). I renamed the function factln to be consistent with gammaln. BTW this code appears to originate in Numerical Recipes but no credit was given in Re^4: Challenge: Chasing Knuth's Conjecture, referenced in the parent. All of the function names in that book are 6 characters long, because there's a Fortran (F77) version of the book too. Thus "gammln" instead of "gammaln". sub factln { my$x = (shift) + 1;
my $tmp =$x + 5.5;
$tmp -= ($x + .5) * log($tmp); my$ser = 1.000000000190015
+ 76.18009172947146    / ++$x - 86.50532032941677 / ++$x
+ 24.01409824083091    / ++$x - 1.231739572450155 / ++$x
+  0.12086509738661e-2 / ++$x - 0.5395239384953e-5 / ++$x;
return log(2.5066282746310005*$ser/($x-6)) - $tmp; } [download] Wow, thanks to all of you for the incredible amount of advice. I'll work my way through your suggestions today. Thanks again! I could be mistaken but I think this calculates the probability of i successful selections, not i or more successful selections as claimed above. For the cdf, prob of i or more successes, you need to do the following: my$hypercdf = 0;
for (my $iref=$i; $iref < min($N,$n);$iref++)
{
$hypercdf += hypergeom($n,$m,$N,$iref); } print$hypercdf;
[download]
You may need a less than or equal to in the condition of the for loop. This probably won't make a difference in most cases, as the final probability is usually very small.
my $hypercdf = 0; for (my$iref=$i;$iref <= min($N,$n); $iref++) {$hypercdf += hypergeom($n,$m,$N,$iref);
}
print $hypercdf; [download] Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)? by jonadab (Parson) on Jun 14, 2005 at 13:01 UTC though I don't fully understand at what size number these modules become necessary *Much* smaller numbers than the ones you're dealing with. The details are platform-dependent; for instance, a 64-bit platform can handle, without a big-number library, larger numbers than a 32-bit platform. But nobody is ever going to manufacture a computer that can natively handle numbers of the size you're talking about. Based on Moore's law, you couldn't expect such a computer to be available while Earth is still inhabitable. You may want to look into ways to approximate factorials. Irrespective of what number library you use, I'm not sure it's possible to calculate the exact value of the factorial of 700 in reasonable time on a household computer. Alternatively, look for ways to shortcut the problem, by simplifying before you multiply. You may find that some things cancel out in ways that save a lot of time. Also, if this is for a math class, you may want to check, as some math teachers are quite happy to receive answers in expression form, with not all of the calculations performed, because they're more interested in whether you understand how to work the problem than in whether you can multiply together a few hundred numbers. "In adjectives, with the addition of inflectional endings, a changeable long vowel (Qamets or Tsere) in an open, propretonic syllable will reduce to Vocal Shewa. This type of change occurs when the open, pretonic syllable of the masculine singular adjective becomes propretonic with the addition of inflectional endings." — Pratico & Van Pelt, BBHG, p68 What do you consider "a reasonable time"? P:\test>perl -mTime::HiRes=time -MMath::BigInt -wle"print time;$n=Math::BigInt->new( 700 ); print $n->bfac; print ti +me;" 1118754720.06049 2422040124750272179867875093812352218590983385729207299450679664929938 1602156474204445190516664848192493214566714970498423275250938748173438 3839375763145922845082849997227127414016031105783055846363633712407933 2447820739281101037112665387537180790257577919273108262916904750405235 0550600840122194928923756351362966220200231781096198180461799068974504 2054891261087058908805650391358456221103769328878296090019507413099979 9035970711436279339094292032866260496375825461427727555710003007752906 1414706395743900249885149142644498650064588732269519418995459703339103 5158855923294082956927698608022220028996612834393163002878920338265474 9603473516314765262772257171154686716862814184728741187147936349501653 1974574556604131345060491220449470526233846820888647906733095692923842 1561178801427495490591414836230322620024681644130193484608025499864732 5270606104512088058712293349862185399243309054299576381718806247238195 2326046426143298940706361637536720912327516123783482738407578735677175 3287924251833711954060294360941162934900956604372083673740109088239297 5031224612531245642687296717053747734506443314924558119560479901478736 2095569251615177371103997547305518540663284200147286578962869365237870 8020647632715713644131877343275100726310805695825169381128095724320246 0157111778617472683761623869704457588005158037495665069625778930898095 7257947107016392382315281155796191202873786892389343351985086659339172 5714397527770759059751198934506870173594016967256186471310711501674736 8992690116082633762172346688969840862517264384000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 0000000000 1118754720.13248 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow. or when absolute precission is not required, and in pure perl: #!/usr/bin/perl use Time::HiRes 'time'; my$next = 1;
my @log;
my @fact=(0);

sub more_logs {
my $n = shift; for ($next..$n) { my$log =log($_);$log[$_] =$log;
$fact[$_] = $fact[$_-1]+$log; }$next = $n+1; } more_logs(10); my$l10 = $log[10]; my$il10 = 1.0/$l10; for (@ARGV) { my$start = time;
more_logs($_) if$_ >= $next; my$fact = $fact[$_];
my $exp = int$fact*$il10; my$man = exp($fact-$exp*$l10); my$end = time;
printf("%d! = %0.10fE%d (time: %f-%f)\n", $_,$man, $exp,$start,
+$end) } __END__$ perl /tmp/bigfact.pl 1000 700 600 2000 5000
1000! = 4.0238726008E2567 (time: 1118757636.805467-1118757636.810242)
700! = 2.4220401248E1689 (time: 1118757636.810742-1118757636.810792)
600! = 1.2655723162E1408 (time: 1118757636.810928-1118757636.810945)
2000! = 3.3162750925E5735 (time: 1118757636.811015-1118757636.816799)
5000! = 4.2285779269E16325 (time: 1118757636.816961-1118757637.117728)
[download]
Re: Fastest way to calculate hypergeometric distribution probabilities (i.e. BIG factorials)?
by Anonymous Monk on Jun 14, 2005 at 05:54 UTC
Math::GMP might also be worth a look