Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Sampling from Combination Space

by AdriftOnMemoryBliss (Acolyte)
on Jul 18, 2005 at 04:50 UTC ( #475642=perlquestion: print w/replies, xml ) Need Help??
AdriftOnMemoryBliss has asked for the wisdom of the Perl Monks concerning the following question:


I hope this is the right place to ask this question -- it may be more algorithmic than language-specific.

Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:

158 choose 5 = ~7.7e8 158 choose 6 = ~2.0e10 158 choose 7 = ~4.3e11 158 choose 8 = ~8.0e12

At first I thought I would do so using Math::Combinatorics with something like this:

use Math::Combinatorics my @data = ('a', 'b', 'c', 'd', 'e', 'f', 'g'); my $nmer = 3; my $sampling_density = 2; my $combinations = Math::Combinatorics->new( count => $nmer, data => [ @data ], ); my $to_skip = 0; while (my @subset = $combinations->next_combination()) { if (0 == $to_skip) { $to_skip = int(rand($sampling_density)); Process(@subset); } else { --$to_skip; next(); } }

This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip.

This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing.

My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees even-ness of the sampling) is impractical are there any other options?

I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or disk-space, which works out to something like 1e3 (1000) GB.

I could also (try to) re-implement the combination-generation portion of Math::Combinatorics to allow for random-sampling without generation of intermediates.

I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158-choose-7 case but I'm wondering if there's a cleaner alternative.

Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics -- preferrably one that doesn't require calculation of the intervening combinations

Replies are listed 'Best First'.
Re: Sampling from Combination Space
by blokhead (Monsignor) on Jul 18, 2005 at 05:48 UTC
    What is generally done to sample large spaces of combinatorial objects is to use a ranking scheme. You create an easy-to-compute, 1-to-1 mapping between a set of objects and natural numbers 1 to N (where N is the number of objects). Then, since it is very easy to uniformly sample from a set of numbers, just do that as your sampling operation and use the ranking scheme to get the object that corresponds to the sampled number.

    For these kinds of problems, I always consult my favorite online book, Combinatorial Generation by Frank Ruskey. On page 87 of the PDF, it derives convenient ranking and unranking algorithms for combinations. Here they are in Perl:

    use Math::Pari qw[:int binomial]; use Memoize; use List::Util 'sum'; memoize 'binomial'; sub rank { my @comb = sort { $a <=> $b } @_; sum map { binomial( $comb[$_]-1, $_+1 ) } 0 .. $#comb; } # $k is the comination size # $r is the rank sub unrank { my ($k, $r) = @_; my @comb; for my $i (reverse 1 .. $k) { my $p = $i-1; $p++ until binomial($p,$i) > $r; $r -= binomial($p-1, $i); push @comb, $p; } return @comb; }
    The rank function takes a combination of size k from {1 .. N} into an integer in {0 .. C(N,k)-1}. The unrank function goes the other direction. In fact, N does not even need to be known to these algorithms (and k only to the unrank algorithm).

    Since it makes a lot of use of computing binomial coefficients, I've used Math::Pari's version and also memoized it for speed. However, I haven't gotten the algorithm to work with large Pari numbers (I can't remember how to work with Pari objects, and it's getting late).

    Now, assuming you can get these algorithms to work on the large numbers involved, and you can sample from the appropriately large range of integers, you can perform any sampling method you want on {0 .. C(N,k)-1} and apply it to the set of combinations.


    Having directly addressed your question, let me say that I find it hard to imagine that the naïve sampling approach (repeatedly choosing a uniformly random k-subset) would significantly skew any statistical properties. The sample space is so large and the chance of collision is so small. Are you absolutely sure that all this effort to avoid repeated samples isn't overkill?


Re: Sampling from Combination Space
by Zaxo (Archbishop) on Jul 18, 2005 at 06:02 UTC

    You can get a random subset without replacement using List::Util::shuffle().

    use List::Util qw/shuffle/; sub pick { return if @_ != 2; my ($num, $count) = @_; (shuffle 1 .. $num)[0 .. $count-1]; }
    To build a sample of picks with repetition allowed:
    my ($top, $ct, @samples) = (158, 8); my $size = 50,000; $#samples = $size; $_ = [pick( $top, $ct)] for @samples;
    You can keep them unique by storing the samples as keys of a hash,
    my %sample; $sample{join ',', pick($top, $ct)} = undef while $size > keys %sample;
    If you need a memory-breaking large sample, look to a file or database to hold it. Tie::File would be simple to use for that.

    Does order matter in a sample? If not, then sort them to get a canonical order before storage.

    After Compline,

Re: Sampling from Combination Space
by pg (Canon) on Jul 18, 2005 at 05:40 UTC

    I have an interesting suggestion.

    To generate a sampling, you just repeat the following steps:

    1. set $sum to 0, $count to 0
    2. generate a random number $r between 1 and m
    3. add this random number to the $sum += $r
    4. if $sum <= 158, $count ++, else quit the loop

    After this process, the variable $count contains the number of random number you used to reach 158, which can be considered as number of selections.

    Repeat above process and you will get the satistics. Refine this a little bit and pick the right m, will give you soem good enough simulation. You have to refine this.

    Here is some code to illustrate this:

    use strict; use warnings; my @counts; while (1) { my $count = 0; my $sum = 0; while (1) { $sum += rand(rand(rand(rand(158))));#need to refine print "$sum\n"; ($sum <= 158) ? $count ++ : last; } $counts[$count] ++; print "choose $count\n"; sleep 1; }
Re: Sampling from Combination Space
by tlm (Prior) on Jul 18, 2005 at 12:30 UTC

    blokhead's approach is certainly the most general, and I'm glad he posted it. My naive approach for this case, however, would have been to use a simple modification of the Fisher-Yates shuffling algorithm (untested and not optimized):

    sub random_pick { my ( $n, $arr ) = @_; my @idx = 0..$#$arr; my $i = 0; while ( $i < $n ) { my $j = $i + rand( @idx - $i ); @idx[ $i, $j ] = @idx[ $j, $i ]; ++$i; } return @{ $arr }[ sort { $a <=> $b } @idx[ 0 .. $n-1 ] ]; }
    This samples randomly with replacement, but you can keep track of the picks (in a hash), and discard any repeats. As blokhead noted, for the numbers you're looking at this would happen very rarely, and probably never during your use of the algorithm.

    Zaxo's approach also uses FY, but doing a full shuffle each time. On the one hand this is more work than is necessary, but on the other List::Util::shuffle is a compiled extension, so it may end up being faster than the solution above.

    the lowliest monk

Re: Sampling from Combination Space (lcd)
by tye (Sage) on Jul 18, 2005 at 16:04 UTC

    Repeatedly randomly sampling without replacement is equivalent to shuffling the list and just splitting the result into groups of the desired size. So do that but don't start over after that...

    First, shuffle the list.

    Now you can group the list into groups of 5 (or 8 or whatever) items and you've got int(158/5) unique samples. Pick a number that is relatively prime with your list size and you've got another int(158/5) samples, none of which match any of your other samples. Repeat as needed. If you run out of relatively prime numbers before you get enough samples, then you can reshuffle the list at that point (risking repeating a sampling but with even lower odds than those you theorized would be acceptable).

    #!/usr/bin/perl -w use strict; # Here are some stubs so you can test my code. # Replace this with a Fisher Yates shuffle: sub PermuteList { my $av= shift @_; print "Not shuffling ( $av->[0] .. $av->[-1] )\n"; } # Replace these with your own code: sub GetPopulation { return( 100..(99+$ARGV[0]) ); } sub ProcessSample { print "( @_ )\n"; } sub Lcd { my( $x, $y )= @_; while( 1 ) { if( $y < $x ) { ( $x, $y )= ( $y, $x ); } if( $x == 0 ) { return $y; } $y %= $x; } } sub GenNextSample { my( $size, @list )= @_; my $step = @list; my( $start, $offset ); return sub { while( 1 ) { if( @list/2 <= $step ) { $step= 1; PermuteList( \@list ); $offset= $start= int rand @list; } if( 1 == Lcd( $step, 0+@list ) ) { my @sample; do { return @sample if $size == @sample; push @sample, $list[$offset]; $offset= ( $offset + $step ) % @list; } while( $offset != $start ); } $step++; } }; } my $size= $ARGV[1]; my $iter= GenNextSample( $size, GetPopulation() ); my $samples = 0; while( $samples++ < 64000 ) { my @sample = $iter->(); ProcessSample( @sample ); }

    You can test the code like:

    > perl 16 3 | more Not shuffling ( 100 .. 115 ) ( 112 113 114 ) ( 115 100 101 ) ( 102 103 104 ) ( 105 106 107 ) ( 108 109 110 ) ( 112 115 102 ) ( 105 108 111 ) ( 114 101 104 ) ( 107 110 113 ) ( 100 103 106 ) ( 112 101 106 ) ( 111 100 105 ) ( 110 115 104 ) ( 109 114 103 ) ( 108 113 102 ) ( 112 103 110 ) ( 101 108 115 ) ( 106 113 104 ) ( 111 102 109 ) ( 100 107 114 ) Not shuffling ( 100 .. 115 ) ( 102 103 104 ) ...

    - tye        

      Repeatedly randomly sampling without replacement is equivalent to shuffling the list and just splitting the result into groups of the desired size. ...

      I am afraid such a method would seriously alter the outcome of any statistical analysis. For instance, if the number of elements in each combination is a divisor of the number of elements you pick from (in this case 158), then all (158) elements would appear the exactly same number of times in your sample.



      The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://475642]
Approved by jbrugger
Front-paged by jbrugger
and the sunlight beams...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2018-06-20 07:31 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.