P is for Practical  
PerlMonks 
Sampling from Combination Spaceby AdriftOnMemoryBliss (Acolyte) 
on Jul 18, 2005 at 04:50 UTC ( #475642=perlquestion: print w/ replies, xml )  Need Help?? 
AdriftOnMemoryBliss has asked for the
wisdom of the Perl Monks concerning the following question:
Hello, I hope this is the right place to ask this question  it may be more algorithmic than languagespecific. Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:
At first I thought I would do so using Math::Combinatorics with something like this:
This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip. This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing. My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees evenness of the sampling) is impractical are there any other options? I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or diskspace, which works out to something like 1e3 (1000) GB. I could also (try to) reimplement the combinationgeneration portion of Math::Combinatorics to allow for randomsampling without generation of intermediates. I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158choose7 case but I'm wondering if there's a cleaner alternative. Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics  preferrably one that doesn't require calculation of the intervening combinations
Back to
Seekers of Perl Wisdom

