http://www.perlmonks.org?node_id=158482

MeowChow has asked for the wisdom of the Perl Monks concerning the following question:

Suppose you have a hash which lists relative frequencies of occurance for various possible die rolls:
my %bias = ( 1 => 3.1, 2 => 2.0234, 3 => 1.7 4 => 1.542232, 5 => 1.321249563, 6 => 1.0142, );

So, for example, the ratio of 1's to 3's in a properly randomized output set of sufficient size would approach 3.1/1.7. Well, I've thought of a few ways to get this result, but none of them seem quite right; they trade off either efficiency, elegance, or correctness.

One method is to build a large array, with contents having frequencies approximating those above, and then randomly choose elements from this array. This, however, is imprecise, because of the rounding error from using the array length as your divisor. You can sacrifice memory for extra precision, but only to a point. Precision beyond 5 decimal digits quickly becomes a practical impossibility.

Another alternative is to create a list which gives the cumulative probability at each number:

my @distribution = ( 3.1, 5.1234, 6.8234, ... and so on );
Then, you would do a binary search to find which element is just above rand sum(@distribution). Yuck! I think we deserve an O(1) algorithm here, don't you?

So, this is my conundrum of the moment. Either I'm missing something obvious, or I'm banging my head against Von Neumann's headstone. I'd like to know which it is :)

MeowChow
s aamecha.s a..a\u\$&owag.print

Replies are listed 'Best First'.
Re: Rolling a biased die
by I0 (Priest) on Apr 12, 2002 at 05:39 UTC
my %bias = ( 1 => 3.1, 2 => 2.0234, 3 => 1.7, 4 => 1.542232, 5 => 1.321249563, 6 => 1.0142, ); my \$sum = 0; while( my(\$k,\$v) = each %bias ){ \$rand = \$k if rand(\$sum+=\$v) < \$v; }
Update:I'm totally wrong! Please ignore me! And take my advice: try the code before you claim it's broken :) Sorry, IO.

I don't think you want to put the rand() inside the loop. I think to get the correct results, you need to choose one random number outside the loop, and test it vs. the sum as you go along.

Try out a simple case by hand:

my %bias = (1 => 1, 2 => 1);
Assuming your first iteration picks up (1 => 1), you have:
\$rand = \$k if rand(\$sum += \$v) <= \$v; # This simplifies to: # \$rand = 1 if rand(1) <= 1 # this is always true. That's wrong.
What you really want is something like this:
my \$sum = 0; \$sum += \$_ foreach (values %bias); my \$target = rand(\$sum); \$sum = 0; while (my (\$k, \$v) = each %bias) { if (\$target <= (\$sum += \$v)) { \$rand = \$k; last; } }
I'm sure there's a golfier way to do it, but this demonstrates the basic idea.

Update: Okay, after actually trying the code, I believe I'm totally wrong. I think that what I thought was a combination of two bugs may actually be a clever solution. Though I'm still not sure I believe it produces the correct result distribution. IO, would you care to describe how it works? Sorry about that.

Alan

Cool algorithm. It's like a king of the hill match.

1 starts as king of the hill. (\$rand = 1)

2 comes along and challanges it. Whoever wins stays on top (is assigned to \$rand).

Just like 2, everyone else (3, 4, 5 and 6) gets a chance.

Whoever is left on top (\$rand) is declared the winner. :)

To understand why the probabilities work you have to step through the algorithm backwards.

ie. What is the chance that 6 (the final iteration) is going to win it's match against the king of the hill? \$bias{6} / sum(values %bias), which is obvious.

Now - consider the second last iteration (5). Given that 6 is going to have it's chance in a minute, and hence does not need to be included, what is the chance that 5 will win it's match? \$bias{5} / (sum(values %bias) - \$bias{6}). We remove 6 from the running by excluding it's weighting from the total.

Update: This explanation is awful. :)

How about precaching a hash called %biasbase like this?

my \$sum = 0; \$biasbase{\$_} = (\$sum += \$bias{\$_}) foreach (reverse keys %bias);

And then iterate like this:

while( my(\$k,\$v) = each %bias and not defined \$rand){ \$rand = \$k if rand(\$biasbase{\$k}) < \$v; }

Does that make sense? That way you can bail out early.

Re: Rolling a biased die
by dws (Chancellor) on Apr 12, 2002 at 05:40 UTC
Yuck! I think we deserve an O(1) algorithm here, don't you?

With non-integer probabilities, you can do O(log2(n)), but given small n, O(n) is probably a draw (and the code will be simpler).

I suggest normalizing the biases so that they sum to 1.0. Then you can rand(1.0), rather than having to either store or calculate a rand limit.

Re: Rolling a biased die
by ViceRaid (Chaplain) on Apr 12, 2002 at 10:26 UTC

Hi

This one's bugged me before, but I think you can do it with full accuracy by creating a hash where the keys represent the bounds between different probabilities. For speed, I'm checking the most likely results first in the second iteration:

sub weightedprob { my %bias = @_; my (\$total, %boundaries); # prepare the boundary map foreach ( sort { \$bias{\$b} <=> \$bias{\$a} } keys %bias ) { \$total += \$bias{\$_}; \$boundaries{\$total} = \$_; } # get a random place on the boundary map, look it up my \$random = rand(\$total); foreach ( sort { \$a <=> \$b } keys %boundaries ) { return \$boundaries{\$_} if \$random < \$_; } } my \$result = weightedprob( 1 => 3.1, 2 => 2.0234, 3 => 1.7, 4 => 1.542232, 5 => 1.321249563, 6 => 1.0142, );

I'm not sure if there's a neater way of doing it without having to iterate the hash twice; maybe not, as you need to know the aggregate value of all the values of the weighted die first.

Update: yep, there is a way, as IO ably demonstrated above. I didn't quite see how his/her algorithm was intended. It's shorter, and faster (by about three times, by my Benchmark). I'll doze off again.

//=\\
Re: Rolling a biased die
by abell (Chaplain) on Apr 12, 2002 at 10:39 UTC
Say you have n values with probabilities d_1, ..., d_n. Normalize your values so that all d_i's are at least 1. Then you can do with O(sum(d_i)) storage and O(1) time. For each integer l less than sum(d_i) there is at most one transition from one d_i to the next one between l and l+1. For each l, just store either the outcome which covers all the l-th unit interval or the two outcomes and the transition point.
It's not the most elegant solution and in case you have very small values you may think of some variation, but in your case it should work quite well.
Best regards
Re: Rolling a biased die
by giulienk (Curate) on Apr 12, 2002 at 21:35 UTC
I'm using the code i stole from the Perl CookBook, Recipe 2.10 "Generating Biased Random Numbers". It goes like this:
#code from the Perl Cookbook: see the link above for credits sub weight_to_dist { my %weights = @_; my %dist = (); my \$total = 0; my (\$key, \$weight); local \$_; foreach (values %weights) { \$total += \$_; } while ( (\$key, \$weight) = each %weights ) { \$dist{\$key} = \$weight/\$total; } return %dist; } sub weighted_rand { my %dist = @_; my (\$key, \$weight); while (1) { # to avoid floating point inaccuracies my \$rand = rand; while ( (\$key, \$weight) = each %dist ) { return \$key if (\$rand -= \$weight) < 0; } } }
Use weight_to_dist to obtain the HASH with distribution and then give it to weighted_rand for a random biased value.

\$|=\$_='1g2i1u1l2i4e2n0k',map{print"\7",chop;select\$,,\$,,\$,,\$_/7}m{..}g

Re: Rolling a biased die
by hossman (Prior) on Apr 12, 2002 at 22:11 UTC
While IO's solution is really cool, and really simple, it's run time is O(n) (where n = the number of distinct die values). More specificly it is EXACTLY n (i don't remember what that's called ... Theta? Omega?) -- it can never do better, because it allways iterates over all the possible die values. If you are rolling the same die over and over and over that may not be good enough for you (you asked for O(1) ... let's see if we can get there.)

abell's approach is a lot closer to a true O(1) solution. It requires some initialization cost and more space to build a lookup table -- but depending on the how you rank memory/speed it may be worth it. (especially if all of your frequencies are 'relatively' close, requiring you to have 'relatively' few buckets for each value after you've multipled the frequencies by the neccessary factor to get them above 1. The problem is: It's still not truely O(1). For a lot of distrobutions, it will be really close, but you still have to pay an extra cost to resolve th conflicts on all the transition points -- it may seem like that would be inconsequential to the run time, but it really depends on what is important to you. (Example: If you roll the die m times, and your frequencies are such that 1 roll out every k requires a transition point, and picking the riht number at a transtion requires t time, then the run time is O(1 + t(m/k)) ... even if t is 1 (constant time) that's breaks down to O(m/k) which is probably worse then O(n))

(UPDATE: abell made me see the light, his way is O(1) ... and requires a lot less space then the one below, even if the frequencies require that (almost) every bucket be a transition point.)

The only approach I know of that's truely O(1) for every lookup requires that all of the frequency values be integers, such that you can build an array (similiar to the one abell suggested) where each die value has the same number of buckets as it's frequency, and you just pick a random integer value less then the size of the array to use as an index for looking up your die value. This will allways be a true O(1) lookup time cost, but the memory costs can be seriously prohibitive. In MeowChow's orriginal six sided example, converting the frequencies to interger values requires an array of 1322822219 (31 + 20234 + 17 + 1542232 + 1321249563 + 10142) buckets. This appraoch is hardless worth using unless the frequencies are all small integer values and n is "big but not too big" (ie: n ~ 100, and most of the frequencies are 1 with a few 2s 3s, and other small integers)

All in all, IO's solution is the best when you have no idea what n will be, or what the distrobution of frequencies will be. BUT! ... If you know at coding time what n & the frequencies are, and n is relatively small, you can probably get the fastest possible lookup time by hardcodding in a 'switch statement'. If the frequencies are heavily skewed towards a few values, this should definitely give you a performance boost for the "common roll".

The problem is: It's still not truely O(1). For a lot of distrobutions, it will be really close, but you still have to pay an extra cost to resolve th conflicts on all the transition points -- it may seem like that would be inconsequential to the run time, but it really depends on what is important to you. (Example: If you roll the die m times, and your frequencies are such that 1 roll out every k requires a transition point, and picking the riht number at a transtion requires t time, then the run time is O(1 + t(m/k)) ... even if t is 1 (constant time) that's breaks down to O(m/k) which is probably worse then O(n))
Actually, O(1)=O(k) for any constant k, because the O(*) expression "eats" all constant multiplicative factors. Furthermore, you can't put the number of tries in the complexity count and expect it to remain constant. Even if a single computation requires just one CPU cicle, the complexity for computing m is O(m). I believe MeowChow asked for a solution with constant execution time regardless of the number of faces of the die and not of the number of rolls.

Best regards

Antonio

My bad ... you're totally right.

I don't know what i was thinking when i did that calculation. Even if every bucket is a transition point, resolving the transtion only requires that you pick between the two possible (weighted) values for that bucket.

O(1 + k) = O(1)

Re: Rolling a biased die
by mattr (Curate) on Apr 13, 2002 at 12:44 UTC
Caveat I am not a mathematician but this excited me!

I also was thinking of normalizing all n weights and partioning the number line into n contiguous segments from 0 to 1, and then of course limiting decimal precision of the weights would allow the array-based algorithm mentioned above.

I don't know how intensive random number generation is (maybe use a preprepared list..) but I got to thinking about defining a set of geometric regions, each the area of which is one of your weights. This was borne out by my looking at documentation for the recently updated GNU Scientific Library.

You can pick random numbers and ignore those which fall outside the region. For example you can choose a random unit vector by picking random numbers until you get two which fall inside a unit circle. It would seem you could do the same thing by either partitioning the circle into regions with an appropriate area, for example into pie slices or maybe swiss-cheese holes, which would let you provide a couple preset weights and a default "otherwise" case which is the cheese. Of course then you get into circle packing problems.. My wierder visualization was to start with a unit circle, or regular polygon, and add triangles around the circumference for the floating point. But obviously you could just use a circle and pick good angles to cut the pizza pie. (Sorry my pies are pizza pies..)

Likewise you can divide a unit square into areas that match your weights.

Then thought, perhaps the weights you desire fit the mathematical curve of a known distribution, which might already exist as C code. So..

The recently updated GNU Scientific Library has a huge list of wonderfully named distributions (wish I had enough math or time to learn math to know them) which might be useful for you. But check out General Discrete Distributions on preprocessing the probability list into a lookup table.

First they discuss making a cumulative probability table where C = 0; C[k+1] = C[k]+P[k]; C[k] = 1 which requires log K steps per random number generation.

Then a very clever method by Marsaglia (1963) is mentioned briefly which for large K (K being the number of discrete events with different probabilities) required a huge lookup table. They finally recommend an efficient procedure devised in 1977 by Alistair J. Walker and covered by Knuth. Since I can't improve on their explanation, here it is:

This requires two lookup tables, one floating point and one integer, but both only of size K. After preprocessing, the random numbers are generated in O(1) time, even for large K. The preprocessing suggested by Walker requires O(K^2) effort, but that is not actually necessary, and the implementation provided here only takes O(K) effort. In general, more preprocessing leads to faster generation of the individual random numbers, but a diminishing return is reached pretty early. Knuth points out that the optimal preprocessing is combinatorially difficult for large K.

It would be really neat if you could provide us with a Perl interface (maybe exists?) to the relevant GSL functions, which would seem to be the four gsl_ran_discrete_xxx() functions. It takes an arbitrary length array of positive numbers of double precision and you can just hand it your weights and it normalizes for you. You build a table first which then is used to generate random numbers which exhibit the weighted probability you described. Math::Gsl exists, should be able to access through that.. though some things may not be implemented it seems to say.

Or maybe just snip the relevant code and use Inline::C? Would like to see what that looks like.

Incidentally the GSL section on Shuffling and Sampling would seem to be quite useful to Perl people.

Maybe math people will be interested in an interesting-looking talk Donald Knuth is going to give at Stanford U. April 23 on a very efficient method of topological sorting. He is providing a preview as 64 pages of gzipped postscript which I was able to view on a windows computer with an online viewer, cached (I hope) here. Knuth says this method has a big impact on computer science. Perhaps a math type would like to mention a couple salient points in it for the rest of us?

There are other packages out there especially R, listed on the Goose page

Cheers,
Matt

Note: The CPAN link above may work for you, I can't see it since CPAN has blackholed this cafe. Would like to hear from anyone who has used Perl bindings to the GSL.