Algorithm RFC: fast (pseudo-)random shuffle with no repetitionby Anonymous Monk
|on Sep 22, 2023 at 21:20 UTC||Need Help??|
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
The was the SO question recently, and as it sometimes happens, when I think "oh, this can be fun to play with, for better algorithm", it brewed for itself somewhere in subconscious, until the eureka moment a couple days ago (I'm in no hurry :-)): "Of course! Be greedy, demand twice as needed!"
There was advice to use brute force, at least as accepted answer in linked question (from 2021) there. Not sure if I got it right, I don't read Ruby, and didn't try other answers. Both brute subroutines below aren't actually used: they are totally unusable for lists with ~15 unique strings or more, plus any decent amount of duplicates. I wrote 2nd one because I couldn't believe it at first (accepted solution??). They are left just in case anyone wants to try (or point at my mistakes in implementation?), and can be ignored.
Back to answers at SO, there are 2 Perl solutions. One (a) doesn't compile; (b) if fixed, emits a warning for un-initialized value; (c) if fixed (or ignored), it seems to work OK. But for (corner-case, of course) input of (b,a,a), it gives answer (b,a). I didn't look further.
Another solution (by esteemed monk) fails randomly for e.g. corner-case (a,a,a,b,b) -- the only answer can be (a,b,a,b,a), of course. Why does it fail? Output list is initialized to e.g. (a,b). If 1st key to iterate is "a", then one of 2 remaining "a"'s is added to give (a,b,a) with no place for 2nd remaining "a". So, easy fix would be kind of "breadth-first" hash consumption. I'm sorry if code I had to add looks ugly to the author.
This fixed version will serve as reference to compare my solution to, it generates truly random lists.
With algorithm I suggest -- ask for twice as many random indexes from remaining pool, then simply reject (comb out) half of them. It guarantees there will be no consecutive dupes (and of course doesn't mean "only odd or even indexes for this value").
One obvious compromise on randomness will be "dupes are never placed at both head and tail" -- except corner-cases such as 'aba' or 'abaca', of course. There are actually 3 cases, depending on size of remaining pool. Cases "2" and "3" restrict randomness further. E.g., for 'aaaabbbcc', the 'c' is never placed at indexes 0 or 1 -- unlike the "reference SO implementation with true randomness".
However, lines with "die" in them can be un-commented (and they were un-commented during benchmarking) if input is not an artificial corner-case -- this code is never reached with realistic data. I mean, other than corner-cases and head/tail restriction, my algorithm seems to produce random enough result.
(In fact, one of "requests" of "RFC" is how to estimate randomness (entropy) for multiple runs of subroutine. Didn't look into that yet.)
Further "requests" are: can it be improved? Both List::MoreUtils::samples and e.g. (unused) Math::Prime::Util::randperm return their result shuffled, which I don't need and have to sort back to order! And more, e.g. samples takes random samples and therefore should know which items were unselected, but I have no better way to find out "which" except with more work using singleton. It feels like huge amount of unnecessary work I do (though it's still much faster than "SO reference solution"). Or maybe, perhaps, someone would suggest even faster solution?
(+ I understand there's sloppiness on my side in e.g. $uniq variable name doesn't actually mean number of unique items which fake_data returns. I hope this (and similar) can be forgiven.)