|Just another Perl shrine|
Sampling from Combination Spaceby AdriftOnMemoryBliss (Acolyte)
|on Jul 18, 2005 at 04:50 UTC||Need Help??|
AdriftOnMemoryBliss has asked for the
wisdom of the Perl Monks concerning the following question:
I hope this is the right place to ask this question -- it may be more algorithmic than language-specific.
Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:
At first I thought I would do so using Math::Combinatorics with something like this:
This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip.
This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing.
My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees even-ness of the sampling) is impractical are there any other options?
I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or disk-space, which works out to something like 1e3 (1000) GB.
I could also (try to) re-implement the combination-generation portion of Math::Combinatorics to allow for random-sampling without generation of intermediates.
I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158-choose-7 case but I'm wondering if there's a cleaner alternative.
Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics -- preferrably one that doesn't require calculation of the intervening combinations