http://www.perlmonks.org?node_id=230620


in reply to Drawing samples from a set

If your data set is hundreds of items (i.e. not thousands), you don't really have an efficiency problem. Select $r, then start through your list, adding the probability of each item to a total until the total > $r, at which point you have your choice. Linear time. For N items, worst case is N checks, average case N/2.

I'm not a math major, but I don't really see an alternative if your probabilities are not predictable in advance. If they are, you can tackle an algorithm without going through the list, but you imply this is not the case.

Here is one idea though. This is off the top of my head, doubtless it needs some refinement:

If your probabilities don't change often, you could pre-sort the list into "chunks" of recorded sizes (but near a target, frex 0.1) which would make your select time near constant to select the correct chunk, then linear inside that chunk (where you have notably fewer items). If your probabilities change often though, this sorting may not be worthwhile.

Example: a b c d e f g h i j with probabilities 0.02, 0.05, 0.20, 0.11, 0.1, 0.14, 0.08, 0.08, 0.09, 0.11)

We can sort this into a data structure like:

[ { Total => 0.07, Set => [ { a=> 0.02, b => 0.07}] }, { Total => 0.27, Set => [ { c => 0.20}] }, { Total => 0.27, #repeat since it span's two "subsets" Set => [ { c=> 0.20 }] }, { Total => 0.38, Set => [ { d => 0.11 } ] }, #etc ]
So the end result is that you can examine $sorted->[int($r*10)], move up down inside sorted (usually no more than one element, depending on the differences between your probablities), then go through the subset in that chunk linearly.

But again, this sorting will be much more expensive than a single linear search, so you'll have to decide if it's worth it.

Replies are listed 'Best First'.
Re: Re: Drawing samples from a set
by John M. Dlugosz (Monsignor) on Jan 28, 2003 at 17:10 UTC
    If he creates another list of n items listing the cumulative probability to that point, then he can do a binary search. That's more general then the pre-computed "chunks".