Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Those are very interesting algorithms, which I was not familiar with. Thank you very much for implementing them.

However, no offense to Mr. Knuth or his S algorithm, but selecting n unique elements from an array containing N elements only needs to consider n elements. The solution below is based on a linear shuffle algorithm I saw on perlmonks. Instead of shuffling all N elements of the array, I instead shuffle the first n elements and return only those shuffled.

The code:

sub selection_sample { my ($array, $n) = @_; my @result; die "Too few elements (". scalar @$array .") to select $n from\n" unless $n <= @$array; my %i; for (0..$n - 1) { my $r = int($_ + rand(@$array - $_)); push @result, $array->[defined($i{$r}) ? $i{$r} : $r]; $i{$r} = defined($i{$_}) ? $i{$_} : $_; } return \@result; }
The basic operation of the linear shuffle is: For each element in the array, randomly choose any element from this element up to the end of the array, and swap the two elements. Please see the original linear shuffle code for a clear demonstration of how to shuffle an entire array.

In the code above, I use a shadow hash %i to store the indices of the shuffled elements of $array. I do this so I can choose a random set from $array without changing the contents of $array, and without copying the entire array. I use a hash %i and defined() tests instead of having to initialize an entire array @i with (0..@$array-1).

I'm not sure of the original requirements of the algorithm, but even using a shadow hash instead of copying the array, my code does use more memory than Knuth's S algorithm. This may not be acceptable in all cases, but if it isn't, then you probably shouldn't be using perl :)

Depending on your requirements, it may be faster or more efficient to chop out the optimizations, which only make sense for a large $array compared to $n (and/or, if you don't want the array to actually be shuffled).

Thanks for the good and inspiring post!

Alan

Update: Ah, your point is well taken: In the context of any data storage which allows only linear access to the records, Knuth's algorithm is definitely the better solution. I'm sorry I seemed to have missed this point.

I do very much like the second algorithm you presented, probably because I didn't miss the point of that one :) I think that most of the time that would be the more appropriate algorithm for my uses when I'm dealing with file access. "Select N random lines from a file" seems like a common task which this would solve efficiently. Most of the time I'm not going to know how many lines I have in the file without counting, which defeats the purpose.


In reply to Re: Random Sampling by ferrency
in thread Random Sampling by demerphq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (5)
    As of 2014-09-24 01:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (244 votes), past polls