Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Drawing samples from a set

by swiftone (Curate)
on Jan 28, 2003 at 15:44 UTC ( #230620=note: print w/replies, xml ) Need Help??

in reply to Drawing samples from a set

If your data set is hundreds of items (i.e. not thousands), you don't really have an efficiency problem. Select $r, then start through your list, adding the probability of each item to a total until the total > $r, at which point you have your choice. Linear time. For N items, worst case is N checks, average case N/2.

I'm not a math major, but I don't really see an alternative if your probabilities are not predictable in advance. If they are, you can tackle an algorithm without going through the list, but you imply this is not the case.

Here is one idea though. This is off the top of my head, doubtless it needs some refinement:

If your probabilities don't change often, you could pre-sort the list into "chunks" of recorded sizes (but near a target, frex 0.1) which would make your select time near constant to select the correct chunk, then linear inside that chunk (where you have notably fewer items). If your probabilities change often though, this sorting may not be worthwhile.

Example: a b c d e f g h i j with probabilities 0.02, 0.05, 0.20, 0.11, 0.1, 0.14, 0.08, 0.08, 0.09, 0.11)

We can sort this into a data structure like:

[ { Total => 0.07, Set => [ { a=> 0.02, b => 0.07}] }, { Total => 0.27, Set => [ { c => 0.20}] }, { Total => 0.27, #repeat since it span's two "subsets" Set => [ { c=> 0.20 }] }, { Total => 0.38, Set => [ { d => 0.11 } ] }, #etc ]
So the end result is that you can examine $sorted->[int($r*10)], move up down inside sorted (usually no more than one element, depending on the differences between your probablities), then go through the subset in that chunk linearly.

But again, this sorting will be much more expensive than a single linear search, so you'll have to decide if it's worth it.

Replies are listed 'Best First'.
Re: Re: Drawing samples from a set
by John M. Dlugosz (Monsignor) on Jan 28, 2003 at 17:10 UTC
    If he creates another list of n items listing the cumulative probability to that point, then he can do a binary search. That's more general then the pre-computed "chunks".

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://230620]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2018-02-20 08:49 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (268 votes). Check out past polls.