http://www.perlmonks.org?node_id=618798

scoobyrico has asked for the wisdom of the Perl Monks concerning the following question:

I have a piece of code that has worked a long time, but I know there has got to be a better way to do this. I dislike gotos but that was the best I could come up with at the time. Any help would be much appreciated! Thanks in advance! My goal is to get the most random selection of items from the @weighteddiv array to the @selected array without duplicating common numbers, i.e. the @selected array can not have a duplicate number. The values are int 4 digits, like 2345. The @weighteddiv will have 1...N numbers at various frequencies, like 2200 of 2161, 1854 of 3222, 50 of 2344 and so forth. Hopefully this clears up more of what this code is suppose to do.
use List::Util shuffle; @weighteddiv = shuffle @weighteddiv; @selected = (); foreach $i (0..$maxclients-1) { xyz: if ($#weighteddiv > = 0) { $idx=int(rand($#weighteddiv)); $selected[$i] = $weighteddiv[$idx]; if ($i>0) { foreach $j (0..$i-1) { if ($selected[$i] == $selected[$j]) { goto xyz } } } } }

Replies are listed 'Best First'.
Re: removing the goto
by kwaping (Priest) on Jun 01, 2007 at 17:23 UTC
    I had the benefit of a direct discussion with the OP about this, which brought much clarity as to his intentions. He has an array of ~4000 elements and wants to select five non-duplicates out of there. So basically, his code can be reduced to this:
    use List::Util 'shuffle'; my %unique = map { $_ => 1 } @weighteddiv; my @shuffled = shuffle(keys %unique); @selected = @shuffled[0..4];
    I noticed he was randomizing the array, then selecting random indices from there, which was not only reduntdant but created the issue of potentially selecting the same element.

    Update: The big array is *not* already internally unique after all, so I had to add the line to unique it.

    Update 2: My solution is no longer relevant, after the OP updated with more information. However, I'm leaving my code as-is because it might help a future monk looking for a similar solution.

    ---
    It's all fine and dandy until someone has to look at the code.
      use List::Util 'shuffle'; use List::MoreUtils 'uniq'; my @shuffled = shuffle( uniq @weighteddiv ); @selected = @shuffled[0..4];
        It might be more appropriate to do the shuffle before the uniq if the OP wants the duplicate values to have a greater chance of being selected - as suggested by the name @weighteddiv.
        use List::Util 'shuffle'; use List::MoreUtils 'uniq'; use strict; my @weighteddiv = ( (1..50), (1..10) x 10, ); my @shuffle1 = shuffle( uniq @weighteddiv ); my @shuffle2 = uniq( shuffle @weighteddiv ); printf "shuffle1: %s\n", join(', ', sort {$a <=> $b} @shuffle1[0..10]) +; printf "shuffle2: %s\n", join(', ', sort {$a <=> $b} @shuffle2[0..10]) +;
        Output:
        shuffle1: 2, 5, 16, 17, 23, 25, 26, 27, 39, 42, 43 shuffle2: 1, 3, 4, 5, 6, 8, 9, 10, 13, 40, 50
        This approach is appropriate for the current implementation of uniq, which preserves the original order while removing duplicates.
Re: removing the goto
by shmem (Chancellor) on Jun 01, 2007 at 17:01 UTC
    Quick fix - use redo:
    use List::Util shuffle; @weighteddiv = shuffle @weighteddiv; @selected = (); xyz: foreach $i (0..$maxclients-1) { # foreach $i (0..$maxclients-1) { # no need for an extra bl +ock # xyz: { # see below if ($#weighteddiv > = 0) { $idx=int(rand($#weighteddiv)); $selected[$i] = $weighteddiv[$idx]; if ($i>0) { foreach $j (0..$i-1) { if ($selected[$i] == $selected[$j]) { redo xyz; } } } # } } }

    update: Removed the extra block - reading the docs helps, really...

    Thanks to TimToady (privmsg), ysth and imp for their effort convincing the stubborn

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      No need for the extra block:
      use List::Util shuffle; @weighteddiv = shuffle @weighteddiv; @selected = (); xyz: foreach $i (0..$maxclients-1) { if ($#weighteddiv > = 0) { $idx=int(rand($#weighteddiv)); $selected[$i] = $weighteddiv[$idx]; if ($i>0) { foreach $j (0..$i-1) { if ($selected[$i] == $selected[$j]) { redo xyz; } } } } }
        Erm... wasn't the goto label placed below the foreach in the OP, and a goto dinn' advance the iterator?

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: removing the goto
by GrandFather (Saint) on Jun 01, 2007 at 23:26 UTC

    I'd refactor the loop to eliminate the goto, the label, most of the nesting, and to make the intent a little clearer:

    foreach my $i (0..$maxclients-1) { $selected[$i] = $weighteddiv[int rand @weighteddiv]; next if $i == 0; # Check for duplicates redo if grep {$selected[$i] == $selected[$_]} 0 .. $i-1; }

    Update: s/next if/last if/ - bug pointed out by TimToady

    Update: removed while loop - issue pointed out by TimToady
    fixed bug due to next in foreach modified statement
    fixed domain for rand - issue pointed out by TimToady


    DWIM is Perl's answer to Gödel
Re: removing the goto
by jettero (Monsignor) on Jun 01, 2007 at 16:59 UTC
    xyz: foreach () { next xyz; # UPDATE: it has been brought to my attention that # shmem's "redo" below (and right here) is a better choic +e. # Alas, I pictured this upside down. }

    That definitely isn't quite the same, but it's possibly less objectionable.

    Personally, I find the goto statement perfectly acceptable, when not abused. Call it a longjump if you like, sometimes it just makes sense to use it.

    -Paul

Re: removing the goto
by gaal (Parson) on Jun 02, 2007 at 07:14 UTC
    You already have good solutions, I'll just suggest an approach that works very similarly but is spelled out in different style.

    # I'll call @weighteddiv "@input" here, and $maxclients "$choose_count +". { my %selected_set; my $choose_one = sub { $selected_set{ rand(@input) } = 1) }; $choose_one->() while keys %selected_set < $choose_count; my @selected = @input[ keys %selected_set ]; }

    This uses a new variable to remember which indexes had already been selected. It doesn't shuffle the input so it's less expensive up front but risks running forever if the random selections are unlucky (or indeed if you ask for more selections than the input contains). In practical terms, this means you should choose this strategy when @input is large and $choose_count is small.

    You can of course inline the choose_one function, since it's so simple and only used once. I put it there for clarity.

    Note that your rand never selected the last element of the input.

      Hmm, that's not quite right. You've not eliminated the chance of duplicate values in the output. You want something more like:
      { my %selected_set; my $choose_one = sub { $selected_set{ @input[rand @input] } = 1) } +; $choose_one->() while keys %selected_set < $choose_count; my @selected = keys %selected_set; }
      The problem of non-termination is indeed something that will bite you when you least expect it. I believe it can only be solved probabilistically in the absence of a complete scan of the input, either by shuffle or by calculating a histogram of the set of input values somehow. In a workflow situation I'd probably try to get the histogram precalculated for me, and then you can actually use the numerical weights to make your selection, since this scales better to large weights than duplicating input.

      So in the absence of that kind of knowledge, I see two ways of reducing the probability of a hang. First way is to use a shuffle for small datasets and random selection for large datasets, where small/large division can be arbitrary, or determined dynamically by scanning the front of the dataset to make sure there are "enough" different values.

      The second probabilistic method is to count how many times you've made a random selection, and give up if the number of attempts far outweighs the number of desired values (and maybe print a warning, so you know why your program now takes five seconds to run instead of five microseconds). But running for five seconds and producing some output is a lot better than running forever and producing no output.

        I was under the misapprehension that the input elements were guaranteed to be distinct. But seeing your reply I reread the OP and saw that was indeed not the case. Oops! In the absence of such a guarantee of course I agree with all your observations. Now, since my suggestion was in the first place more a stylistic one and not an algorithmic one, I'm interested in following up on how to code higher-level stuff that observes the running program and aborts and possibly switches attempted computations, how to do this with the minimum uglification of existing implementations, and reasonably efficiently.

        Are there patterns here that you as language designer can see? For example, in a pure world, it's quite reasonable to say "fork off a thread with strategy #1, and give it $x seconds to run. If it fails emit the warning and move off to strategy #2". This is what you suggested except in a pure language what "fork off a thread" means under the hood kind of doesn't matter. It may be an OS thread, or something else -- you know the side effects are contained, and the computation can be killed at any time. Does it make sense to tag a function with "this should never take more than [some amount of CPU / RAM / other resource]"?

        I also wonder if there's any commonality to expressing "tips" about the data, but I guess there isn't. (Like for example if here whoever gave us the data knew there were duplicates, but only very few, how would they express that.)

        Sorry for rambling a bit, I should get some sleep...

Re: removing the goto
by BrowserUk (Patriarch) on Jun 03, 2007 at 04:08 UTC

    You have answers that do what you set out to do. However, I wonder if you realise that what you are doing probably isn't giving you the results that you might be after?

    Reading between the lines of you question and the variable names in your code, you appear to be wanting to pick values 'randomly', but according to some predetermined distribution. The trouble is, that your distribution will go out of the window because you are requiring unique picks.

    To demostrate what I mean, imagine that you want to pick from 'A' through 'F', with 'B' having double the chance of being picked than 'A'; and 'C' double the chance of being picked than 'B'; and so on up the list. The following code, which is just a variation on the other solutions posted, will attempt to do that.

    #! perl -slw use strict; our $ITERS ||= 1000; our $PICKS ||= 4; sub genPicker { my $n = shift; my @indices = 0 .. $n - 1; return sub { return unless @indices; my $index = int rand @indices; my $choice = $indices[ $index ]; $indices[ $index ] = $indices[ $#indices ]; --$#indices; return $choice; }; } my $n = 1; my @data = map{ ( $_ ) x ( $n *=2 ) } 'A' .. 'F'; print "Required frequencies"; for my $value ( 'A' .. 'F' ) { printf "$value : %.1f\n", grep( { $_ eq $value } @data ) / @data * + 100; } my %chosen; for ( 1 .. $ITERS ) { my $picker = genPicker ( scalar @data ); my %picks; while( keys %picks < $PICKS ) { my $pick = $picker->(); my $value = $data[ $pick ]; $picks{ $value }++ ; } $chosen{ $_ }++ for keys %picks; } print "\nActual frequencies:"; printf "$_ : %.1f\n", $chosen{ $_ } / ($ITERS * $PICKS) * 100 for sort{ $chosen{ $a } <=> $chosen{ $b } } keys %chosen;

    And if you only pick one unique value, the frequencies of choices match those required within the bound of random variation:

    C:\test>618798 -ITERS=1e4 -PICKS=1 Required frequencies A : 1.6 B : 3.2 C : 6.3 D : 12.7 E : 25.4 F : 50.8 Actual frequencies: A : 1.7 B : 3.1 C : 6.4 D : 12.4 E : 25.3 F : 51.1

    However, when we pick 2 values repeatedly, the uniqueness criteria starts to distort the actual frequencies:

    C:\test>618798 -ITERS=1e4 -PICKS=2 Required frequencies A : 1.6 B : 3.2 C : 6.3 D : 12.7 E : 25.4 F : 50.8 Actual frequencies: A : 2.1 B : 4.2 C : 8.1 D : 15.4 E : 29.5 F : 40.8

    And as we increase the number picked, so the distortion grows rapidly:

    C:\test>618798 -ITERS=1e4 -PICKS=3 Required frequencies A : 1.6 B : 3.2 C : 6.3 D : 12.7 E : 25.4 F : 50.8 Actual frequencies: A : 3.0 B : 5.7 C : 11.3 D : 20.2 E : 28.0 F : 31.7 C:\test>618798 -ITERS=1e4 -PICKS=4 Required frequencies A : 1.6 B : 3.2 C : 6.3 D : 12.7 E : 25.4 F : 50.8 Actual frequencies: A : 4.5 B : 8.9 C : 16.1 D : 21.6 E : 24.0 F : 24.9

    Until, by the time you are picking one of each value, the distribution is flat. (as it has to be!):

    C:\test>618798 -ITERS=1e4 -PICKS=6 Required frequencies A : 1.6 B : 3.2 C : 6.3 D : 12.7 E : 25.4 F : 50.8 Actual frequencies: F : 16.7 A : 16.7 D : 16.7 C : 16.7 E : 16.7 B : 16.7

    In order for the distribution to remain as you require it, it would be necessary to use a different set of data and distribution values at each level of pick.

    Maybe you know this already, or maybe it doesn't matter for your application, but as I noticed the problem when attempting to verify my iterative picker, I thought it worth pointing out.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: removing the goto
by TilRMan (Friar) on Jun 03, 2007 at 06:05 UTC

    redo LABEL is the answer to your question, but here are a couple of "better" ways to select $maxclients unique elements from a bag (@weighteddiv).

    This approach has the advantage of being simple, but it is relatively slow for large datasets and does not preserve the order in which the numbers were picked:

    my @weighteddiv = (('1234') x 7, ('9876') x 42, ('1123') x 13); my %selected; my $maxclients = 2; until (keys %selected == $maxclients or @weighteddiv == 0) { my $i = rand(+@weighteddiv); # Not $# my $pluck = splice @weighteddiv, $i, 1; ++$selected{$pluck}; } my @selected = keys %selected; print "Selected: @selected\n";

    You can preserve the order with a little more code in the loop to create @selected incrementally:

    if (++$selected{$pluck} == 1) { push @selected, $pluck; }

    Finally, a complicated but efficient solution involving probabilities:

    my @weighteddiv = (('1234') x 7, ('9876') x 42, ('1123') x 13); my @selected; my $maxclients = 2; my %occurrence; foreach (@weighteddiv) { $occurrence{$_}++; } my $total = @weighteddiv; until (@selected == $maxclients or keys %occurrence == 0) { my $i = rand($total); keys %occurrence; # reset hash iterator while (my ($value, $count) = each %occurrence) { if ($i < $count) { push @selected, $value; delete $occurrence{$value}; $total -= $count; last; } $i -= $count; } } print "Selected: @selected\n";

    Caveat: All solutions untested.

Re: removing the goto
by carmen (Initiate) on Jun 04, 2007 at 19:16 UTC
    I know you've gotten lots of answers, but (without getting into what you were trying to do) couldn't the "goto xyz" be replaced with "last"? That should take you to the end of the "foreach $j ...." loop and back to the "foreach $i" loop. Then get rid of the xyz label? -- Carmen
      A 'last' would resume the loop and advance the foreach-iterator. I guess that's what the OP wanted to avoid by placing the goto-label... hence the 'redo'. :-)

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: removing the goto
by Anonymous Monk on Jun 02, 2007 at 18:01 UTC
    use Python or LISP.
      Or use Perl 6:
      @results = @input.pick(*).uniq.[^$numwanted];
      or some such...

      Now that I think of it, if the shuffler (.pick(*)) is sufficiently lazy, that gets around all the problems of the other solutions. It doesn't need to visit all the values, but it'll still visit all the values if you ask for more values than exist uniquely, so it's also guaranteed to terminate.

A reply falls below the community's threshold of quality. You may see it by logging in.