Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Sampling from Combination Space

by blokhead (Monsignor)
on Jul 18, 2005 at 05:48 UTC ( #475645=note: print w/replies, xml ) Need Help??

in reply to Sampling from Combination Space

What is generally done to sample large spaces of combinatorial objects is to use a ranking scheme. You create an easy-to-compute, 1-to-1 mapping between a set of objects and natural numbers 1 to N (where N is the number of objects). Then, since it is very easy to uniformly sample from a set of numbers, just do that as your sampling operation and use the ranking scheme to get the object that corresponds to the sampled number.

For these kinds of problems, I always consult my favorite online book, Combinatorial Generation by Frank Ruskey. On page 87 of the PDF, it derives convenient ranking and unranking algorithms for combinations. Here they are in Perl:

use Math::Pari qw[:int binomial]; use Memoize; use List::Util 'sum'; memoize 'binomial'; sub rank { my @comb = sort { $a <=> $b } @_; sum map { binomial( $comb[$_]-1, $_+1 ) } 0 .. $#comb; } # $k is the comination size # $r is the rank sub unrank { my ($k, $r) = @_; my @comb; for my $i (reverse 1 .. $k) { my $p = $i-1; $p++ until binomial($p,$i) > $r; $r -= binomial($p-1, $i); push @comb, $p; } return @comb; }
The rank function takes a combination of size k from {1 .. N} into an integer in {0 .. C(N,k)-1}. The unrank function goes the other direction. In fact, N does not even need to be known to these algorithms (and k only to the unrank algorithm).

Since it makes a lot of use of computing binomial coefficients, I've used Math::Pari's version and also memoized it for speed. However, I haven't gotten the algorithm to work with large Pari numbers (I can't remember how to work with Pari objects, and it's getting late).

Now, assuming you can get these algorithms to work on the large numbers involved, and you can sample from the appropriately large range of integers, you can perform any sampling method you want on {0 .. C(N,k)-1} and apply it to the set of combinations.


Having directly addressed your question, let me say that I find it hard to imagine that the naïve sampling approach (repeatedly choosing a uniformly random k-subset) would significantly skew any statistical properties. The sample space is so large and the chance of collision is so small. Are you absolutely sure that all this effort to avoid repeated samples isn't overkill?


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://475645]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2022-08-10 23:44 GMT
Find Nodes?
    Voting Booth?

    No recent polls found