http://www.perlmonks.org?node_id=158763

in reply to Rolling a biased die

Caveat I am not a mathematician but this excited me!

I also was thinking of normalizing all n weights and partioning the number line into n contiguous segments from 0 to 1, and then of course limiting decimal precision of the weights would allow the array-based algorithm mentioned above.

I don't know how intensive random number generation is (maybe use a preprepared list..) but I got to thinking about defining a set of geometric regions, each the area of which is one of your weights. This was borne out by my looking at documentation for the recently updated GNU Scientific Library.

You can pick random numbers and ignore those which fall outside the region. For example you can choose a random unit vector by picking random numbers until you get two which fall inside a unit circle. It would seem you could do the same thing by either partitioning the circle into regions with an appropriate area, for example into pie slices or maybe swiss-cheese holes, which would let you provide a couple preset weights and a default "otherwise" case which is the cheese. Of course then you get into circle packing problems.. My wierder visualization was to start with a unit circle, or regular polygon, and add triangles around the circumference for the floating point. But obviously you could just use a circle and pick good angles to cut the pizza pie. (Sorry my pies are pizza pies..)

Likewise you can divide a unit square into areas that match your weights.

Then thought, perhaps the weights you desire fit the mathematical curve of a known distribution, which might already exist as C code. So..

The recently updated GNU Scientific Library has a huge list of wonderfully named distributions (wish I had enough math or time to learn math to know them) which might be useful for you. But check out General Discrete Distributions on preprocessing the probability list into a lookup table.

First they discuss making a cumulative probability table where C[0] = 0; C[k+1] = C[k]+P[k]; C[k] = 1 which requires log K steps per random number generation.

Then a very clever method by Marsaglia (1963) is mentioned briefly which for large K (K being the number of discrete events with different probabilities) required a huge lookup table. They finally recommend an efficient procedure devised in 1977 by Alistair J. Walker and covered by Knuth. Since I can't improve on their explanation, here it is:

This requires two lookup tables, one floating point and one integer, but both only of size K. After preprocessing, the random numbers are generated in O(1) time, even for large K. The preprocessing suggested by Walker requires O(K^2) effort, but that is not actually necessary, and the implementation provided here only takes O(K) effort. In general, more preprocessing leads to faster generation of the individual random numbers, but a diminishing return is reached pretty early. Knuth points out that the optimal preprocessing is combinatorially difficult for large K.

It would be really neat if you could provide us with a Perl interface (maybe exists?) to the relevant GSL functions, which would seem to be the four gsl_ran_discrete_xxx() functions. It takes an arbitrary length array of positive numbers of double precision and you can just hand it your weights and it normalizes for you. You build a table first which then is used to generate random numbers which exhibit the weighted probability you described. Math::Gsl exists, should be able to access through that.. though some things may not be implemented it seems to say.

Or maybe just snip the relevant code and use Inline::C? Would like to see what that looks like.

Incidentally the GSL section on Shuffling and Sampling would seem to be quite useful to Perl people.

Maybe math people will be interested in an interesting-looking talk Donald Knuth is going to give at Stanford U. April 23 on a very efficient method of topological sorting. He is providing a preview as 64 pages of gzipped postscript which I was able to view on a windows computer with an online viewer, cached (I hope) here. Knuth says this method has a big impact on computer science. Perhaps a math type would like to mention a couple salient points in it for the rest of us?

There are other packages out there especially R, listed on the Goose page

Cheers,
Matt

Note: The CPAN link above may work for you, I can't see it since CPAN has blackholed this cafe. Would like to hear from anyone who has used Perl bindings to the GSL.