Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
What is generally done to sample large spaces of combinatorial objects is to use a ranking scheme. You create an easy-to-compute, 1-to-1 mapping between a set of objects and natural numbers 1 to N (where N is the number of objects). Then, since it is very easy to uniformly sample from a set of numbers, just do that as your sampling operation and use the ranking scheme to get the object that corresponds to the sampled number.

For these kinds of problems, I always consult my favorite online book, Combinatorial Generation by Frank Ruskey. On page 87 of the PDF, it derives convenient ranking and unranking algorithms for combinations. Here they are in Perl:

use Math::Pari qw[:int binomial]; use Memoize; use List::Util 'sum'; memoize 'binomial'; sub rank { my @comb = sort { $a <=> $b } @_; sum map { binomial( $comb[$_]-1, $_+1 ) } 0 .. $#comb; } # $k is the comination size # $r is the rank sub unrank { my ($k, $r) = @_; my @comb; for my $i (reverse 1 .. $k) { my $p = $i-1; $p++ until binomial($p,$i) > $r; $r -= binomial($p-1, $i); push @comb, $p; } return @comb; }
The rank function takes a combination of size k from {1 .. N} into an integer in {0 .. C(N,k)-1}. The unrank function goes the other direction. In fact, N does not even need to be known to these algorithms (and k only to the unrank algorithm).

Since it makes a lot of use of computing binomial coefficients, I've used Math::Pari's version and also memoized it for speed. However, I haven't gotten the algorithm to work with large Pari numbers (I can't remember how to work with Pari objects, and it's getting late).

Now, assuming you can get these algorithms to work on the large numbers involved, and you can sample from the appropriately large range of integers, you can perform any sampling method you want on {0 .. C(N,k)-1} and apply it to the set of combinations.

....

Having directly addressed your question, let me say that I find it hard to imagine that the naïve sampling approach (repeatedly choosing a uniformly random k-subset) would significantly skew any statistical properties. The sample space is so large and the chance of collision is so small. Are you absolutely sure that all this effort to avoid repeated samples isn't overkill?

blokhead


In reply to Re: Sampling from Combination Space by blokhead
in thread Sampling from Combination Space by AdriftOnMemoryBliss

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (16)
    As of 2014-07-25 16:47 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (174 votes), past polls