in reply to Random shuffling
The "element" I am identifying in the genome has two ends. These two ends are each around 50-100 long. The separation between these ends is between 200 and 20,000. The length of the ends, and the separation between them is NOT known a priori. So, I wonder if there are two size ranges or periodicities whose non randomness needs to be broken down, with two successive shuffles, one in the ~ 100 length window and again in a ~1000 length window? Or some other window lengths?
Let's start with the minimum length:
50 + 200 + 50. There are 300! = 3.0605751221644063603537046129727e+614 possible outcomes from the shuffle.
For one of your 50 char headers or trailers to have survived the shuffle intact -- ie, to either have remained where it was or to have been 'reconstructed' at some other position the calculation goes something like: at any given position, there are 50! = 3.0414093201713378043612608166065e+64 possible ways that the original 50 chars could have been rearranged; which makes the probability that an exact 50 sequence re-appears somewhere after one shuffle, something like 50! / 300! = 9.9373784297785355338568640895281e-551. Which is as close to 'impossible' as you could wish for.
And that's for the shortest lengths. For your average lengths, you are talking all the computers that have ever existed, do exist and will ever exist, processing flat out until Universal heat death.
Of course; it could also happen tomorrow -- that's the nature of random -- but duplicating the shuffle doesn't make it any more or less likely to happen.
understand any differences between List::Util qw(shuffle) and Array::Shuffle qw(shuffle_array)
They both use the Fischer-Yates method. But ...
List::Util::shuffle takes the array, flattens it to a list on the stack, shuffles that list on the stack and returns it; where it is then assigned back to an array.
Array::Shuffle takes a reference to an array and shuffles that in-place.
The latter is faster because it avoids the array to list & list to array conversions at either end.
Either -- over a suitable sample size -- will produce similar results. The latter gets you there more quickly.