Limbic~Region has asked for the
wisdom of the Perl Monks concerning the following question:
All,
I popped into #perl today to catch the tail end of a question with Dominus and another chatter regarding randomizing a list of 100_000 elements. Dominus stated he would be surprised if any language had any builtin random number generator suitable for randomizing 100K elements because of the amount of entropy needed.
Well I wanted to be argumentative as it had been weeks since I used IRC. Once the other chatter was satisfied with pseudorandomness, I said that repeating the randomization on larger and smaller scales should result in an overall random list. I was thinking a reverse bin sort and here is the process I outlined. (Assume we have 64 elements in our list but only enough entropy to truly randomize 8).
A = 0108, B = 0916, C = 1724, D = 2532
E = 3340, F = 4148, G = 4956, H = 5764
Shuffle A  H
In turn, shuffle all 8 elements of each group as though it were a sing
+le list.
Wash, rinse, repeat
At this point, I was no longer interested in arguing for the sake of arguing (too bad my high school didn't have a debate team). I conceded the point of "truly" random but asked Dominus if he had any way of proving his assertion. I figured that it would be nearly impossible to tell the difference between a "truly" randomization of the list and one that resulted from many of my reorderings. In other words, I was happy to trade linear time and only processing the list once for missing bits of randomness. Unfortunately, he didn't know of one at that moment. So I have two questions.
 How many iterations of my process would it take before you had an acceptable^{*} fake?
 Using code, how can you determine the amount of randomness of a given list?
This entry on shuffling describes a process when using cards but I would like to see it applied to a list.
*  For some definition of acceptable
Re: Random Math Question by blokhead (Monsignor) on Oct 10, 2005 at 22:22 UTC 
I figured that it would be nearly impossible to tell the difference between a "truly" randomization of the list and one that resulted from many of my reorderings.
Looks like you and Dominus have bumped into the very large open problem in theoretical computer science of pseudorandom generation. Essentially, a pseudorandom generator (PRG) is an algorithm that takes a small "seed" of real randomness (say, n bits) and outputs a longer string (say, 2n bits) that "looks sufficiently random." Usually the definition of "looking sufficiently random" means that no polynomialtime algorithm can distinguish the output of the PRG from truly random bits (with a certain probability). Update: In layman's terms, the question is essentially: "Is it possible to tell (in a reasonable amount of time) how much randomness an algorithm uses just by looking at its output?"
Note that this is similar, but different from the notion of pseudorandom number generators (PRNGs) that you find in Perl and elsewhere. For these, "pseudorandom" means "they seem hard to predict so we hope it's pseudorandom in the above sense." ;)
PRGs are absolutely essential for provably secure cryptography. Before anyone asks, modern cryptographic algorithms (like RSA) are definitely not provably secure. Their security is based on widelyaccepted (but unproven) assumptions that certain problems (in this case, discrete logarithm & factoring) are sufficiently difficult to compute.
That PRGs actually exist is even a stronger assumption than P ≠ NP, so this is a very difficult question. Most researchers believe that they do exist on some level. There is a great wealth of research dealing with (assuming PRGs do exist, of course) how much they can expand their random seeds, how simple the PRG algorithms can be, etc.
Anyway, be encouraged: you are in good company thinking that there may be algorithms that don't use much randomness but whose outputs are impossible to distinguish from those that use lots of randomness. On the other hand, since this is an extremely hard problem, don't hold it against Dominus that he wasn't able to come up with a distinguishing algorithm off the top of his head. For that reason, be discouraged as well! Both sides of the problem are quite difficult ;)
Using code, how can you determine the amount of randomness of a given list?
An individual "shuffled" list is neither random or nonrandom. Randomness is a property of the process, not the individual outputs. When talking about an algorithm which tries to distinguish between a PRG and a truly random source, we (usually) allow for it to take multiple samples from the source before saying which kind of source it thinks it's getting. Even then, we allow for some probabilistic error in its decision.
Regarding the question of randomizing a list of 100_000 elements, you need to sample uniformly from the space of permutations of 100_000 elements. This requires log2(factorial(100000)) = 1516705 bits because that is the entropy of the random variable you wish to sample.
 [reply] 

 [reply] 

At the risk of devolving into a purely theoretical, impractical exercise (if it's not already too late (which it is)), here goes nothing ;)
There are two cases...

If pseudorandom generation is impossible, then we can tell (by sampling its output) how much true randomness any algorithm uses (call this the true entropy). In this case, the Mersenne Twister is nowhere near big enough. The MT has 2^19937 configurations, so a single MT has at most 19937 bits of entropy. This is nowhere near the 1.5 million bits required to sample the space in question. There would be a polynomialtime algorithm that would be able to tell (by sampling its output) whether or not your algorithm was using MT.

On the other hand, if the MT is really pseudorandom in the strong sense of my previous comment, then we can talk about not only its true entropy but also its computational entropy, that is, the amount of entropy it can "fool" all polynomialtime algorithms into thinking it uses.
From what I recall, if pseudorandom generation turns out to be possible in this strong sense, it is quite reasonable for a function's computational entropy to be much higher (say, by a squared factor) than its true entropy. In this case, MT could be sufficient to sample the desired space.
Essentially, if pseudorandom generation is possible, then bits from the pseudorandom generator are completely interchangable with truly random bits in the polynomialtime realm. If there is ever a case where it made a (statistically) significant difference in an algorithms output, then already that gives you a distinguishing algorithm that contradicts the definition of the pseudorandom generator! Neat, huh?
 [reply] 


blokhead,
Thanks for all the wonderful explanations.
Let's for this discussion assume that all we need is an algorithm that can potentially produce every p! permutation of a sequence (1..p)
I was wondering whether a two dimensional shuffling (not sure if that is the right term) would help us achieve such a suffle without having to resort to a random variable with entropy 1516705 bits. Here are the steps i have in mind 
1. Partition the sequence into klists (each list containing nelements) and that each list can be truly randomized. Also WLOG let's assume k <= n
2. For all i 1 to n and j = 1 to k,
a. generate integers (X,Y) ~ U from the planar region bounded by x = n and y = k.
b. Now we swap element(i,j) with element(X,Y).
Since each (i,j) is equally likely => i*n+j is equally likely. Is there anything wrong with this approach?
Thanks,
cheers
SK
 [reply] [d/l] [select] 
Re: Random Math Question by Dominus (Parson) on Oct 11, 2005 at 00:58 UTC 
Said Limbic~Region:
I figured that it would be nearly impossible to tell the difference
between a "truly" randomization of the list and one that resulted from many of my reorderings.
... Unfortunately, he didn't know of one at that moment.
Something I meant to say on IRC, but didn't get to, was that I don't think your question here
was exactly wellposed. Suppose someone asked you if you would be able to distinguish a true random number from one
that was only pseudorandom. Of course, you can't; the question isn't really sensical. The question they need to be asking in that case is whether you
would be able to tell a sequence of random numbers from a sequence of pseudorandom numbers.
And in that case the answer is that yes, methods for doing that are wellstudied.
So the question I think you need to ask here is whether a sequence of arrays shuffled by your method
would be distinguishable from a sequence of arrays shuffled into truly random orders. And having thought about it a little more
(but just a little) I think the answer is probably yes; you could apply analogs of
many of the usual statistical tests for randomness to such arrays, and find out that actually
the method you were using wasn't very random at all.
Section 3.3 of Knuth's The Art of Computer Programming goes into
these tests in great detail.
Setion 3.4.2 discusses random shuffling methods, including the FisherYates algorithm, and then says:
R. Salfi has pointed out that Algorithm P cannot possibly generate more than
m distinct permutations when we obtain the uniform U's with a linear congruential sequence
of modulus m, or indeed whenever we use a recurrence U_{n+1} =
f(U_{n}) for which U_{n} can take only
m different values, because the final permutation in such cases is entirely determined by the value of the first U
that is generated. Thus for example, if m = 2^{32}, certain
permutations of 13 elements will never occur, since 13! ≅ 1.42 × 2^{32}.
This was exactly my concern when I originally said that generating truly random permutations
was impossible with the pseudorandom generators supplied with most programming languages, including Perl.
The space of permutations is just too big. However, Knuth continues:
Section 3.5 shows that we need not despair about this.
I haven't read section 3.5 in a long time, and I don't have time to
pore over it now to figure out what Knuth meant by that, unfortunately.
So I really don't know what the situation is.
 [reply] 

Dominus,
So the question I think you need to ask here is whether a sequence of arrays shuffled by your method would be distinguishable from a sequence of arrays shuffled into truly random orders.
Oh, I totally agree now that I have had a night's rest and a few good replies. It was one of those things where, in trying to make a convincing argument, I lost sight of the forest through the trees. As I indicated in another reply, I considered changing the method to change what elements appeared in each group for every iteration. Without that change, it should be easy to see that the reordering isn't random.
I don't mind being wrong for the sake of a good discussion.
 [reply] 

Said Limbic~Region:
As I indicated in another reply, I considered changing the method to change what elements appeared in each group for every iteration. Without that change, it should be easy to see that the reordering isn't random.
As I kept saying on IRC, it doesn't matter how you change the method, because the results won't be random regardless of what method you choose. If you only have enough entropy to randomize 8 things, and you try to generate random permutations of 64 things, you are not going to be able to generate all possible permutations.
You asked for a method that I could use to tell that your method wasn't working. Here's one: give me a sample of 200,000 of the "randomized" 64element lists. You said "(Assume we have 64 elements in our list but only enough entropy to truly randomize 8)." Then I can tell there's something fishy because the sample you gave me will have lots of duplicate lists in it.
You have only about log(8!) = 15.3 bits of entropy, so in the 200,000 items, there will be several that appear five times each. But the probability of such an occurrence in a truly random sample of 200,000 shuffled lists is about 10^{356}. So it'll be totally obvious that you're cheating.
This will be true regardless of what clever method you are using to mix up the items. You can't get blood from a stone. As I said last night, what you're doing is like taking a 16bit PRNG and then using it to generate 32bit numbers by concatenating two 16bit numbers together. If you hand someone a list of your "random" 32bit numbers, it will be clear that they aren't really random.
 [reply] 

Re: Random Math Question by SciDude (Friar) on Oct 10, 2005 at 22:04 UTC 
If each of your groups were randomized at exactly the same time (or perhaps with the same seed) then they should be identical.
Shuffling these groups may mix them up a bit, but only provides for a folded set of groups each of which was identical.
In a perfect process, using the same seed and shuffling procedure would result in an exact copy of AH elements with each set created. I think we can all agree that the outcome of this effort is not random.
The most important reference in this area is Knuth, D. 1981. (1st ed. 1969.) The Art of Computer Programming. Volume 2: Seminumerical Algorithms. AddisonWesley.
I would also suggest a careful look at the Runs Test for Randomness  which is quite simple to understand and follow. Most test for randomness fall into the "runs" category.
If all else fails, simply relax your last requirement for suitable definition of acceptable and declare success!
SciDude
The first dog barks... all other dogs bark at the first dog.
 [reply] 
Re: Random Math Question by traveler (Parson) on Oct 10, 2005 at 21:54 UTC 
Your first question is way byond my mathematical understanding.
Regarding the second, check out random.org. They have papers on random numbers including comments on how to test for true randomness. They also generate true random numbers for you...
traveler  [reply] 
Re: Random Math Question by Roy Johnson (Monsignor) on Oct 10, 2005 at 22:12 UTC 
I may have misunderstood your suggestion, but it seems to me like your method would always keep elements in each group together. They'd be shuffled with respect to each other, but in a continuous block. To get them to migrate and intermingle, you'd want to split each group in half and shuffle the halfsize groups, then shuffle the members of (nothalfsize) groups.
I think MJD's objection to the generator is that, if you're shuffling 100,000 elements, you need a generator that can give you a random number between 1 and 100,000. If your generator doesn't have that, you can start with two random numbers, and use them as separate seeds. Each time you need a big random number, you get two small ones (which also become your two new seeds). You combine the two small random numbers into one big random number.
You can do tricks (skip a number occasionally) to keep them from having synchronized repetitions, and you can vary your method of combining them, to help eliminate patterns.
Caution: Contents may have been coded under pressure.
 [reply] 

Roy Johnson,
Since the groups themselves are getting shuffled as well, any 1 of the 100K numbers could end up in any position in the entire range. I had considered the selection of the groups each iteration to also be random but didn't think it was worth it. I didn't really think my position was correct.
I could have had the characterization of the problem wrong as I did come in on the tail end of it. My understanding was that you already had a list of 100K numbers and you needed to randomize the order. In this case, the numbers themselves don't matter just their positions. My process would be to apply the FisherYates shuffle many times but on different scales. Sometimes using chunks of the list as though they were single elements and sometimes as though each chunk were the entire list.
 [reply] 

 [reply] [d/l] 


Said Roy Johnson:
I think MJD's objection to the generator is that, if you're shuffling 100,000 elements, you need a generator that can give you a random number
between 1 and 100,000. If your generator doesn't have that, you can start with two random numbers, and use them as separate seeds.
No, that really wasn't my objection at all.
 [reply] 
Re: Random Math Question by ickyb0d (Monk) on Oct 10, 2005 at 21:43 UTC 
a little while ago i actually had to take a class on creating accurate simulations of systems... which included generating random numbers.
i believe that most computers (and their computer languages) are unable to really come up with a random set of numbers. most of the time most random number sets will require an initial seed to generate the random numbers from. if one is not provided it simply uses the time (epoch?) as the random seed. most random numbers generated from a computer are called pseudorandom because they can either be predicted from the seed, or they will begin to repeat at one point in time.
the only true way to get random numbers would be to have them generated by a physical process that is completely unpredictable (number of raindrops falling or perhaps radioactive decay). Thus it's very hard to actually determine the randomness of an algorithm, you can only hope to make it more random by perhaps randomizing the seed every time, or something to that effect.
some quick searches on comparing pseudorandom numbers to random numbers, or even looking up 'random number generators' should give you a little more insight onto your question.
 [reply] 

 [reply] 

sorry if i didn't answer your question initially. the thing is... that being random can mean, well... anything. it could mean they could all be the same number, or they could all be completely different numbers for each iteration.
my guess is that you are interested in having different numbers. in this case you would want to determine the random variance of the set of numbers. This will tell you how spread out the numbers are. You could also possibly take the correlation of all of your points (plotted on a graph). The lower the correlation the less related each iteration is to one another; but i'm not really sure if that's a good measure of randomness.
sorry if i still didn't answer your question, statistics never really was my strong point. i still think your best bet is to seek out other algorithms and see which one works best. if you don't have a lot of math background, it might be tough deciphering all the equations. but anyways, i hope this gives you some more insight about all of this.
 [reply] 
Re: Random Math Question by sfink (Deacon) on Oct 11, 2005 at 04:09 UTC 
The only precise definition of "random" that seems possible is if every possible permutation of the list is equiprobable. Which means that the minimum number of bits of entropy you need is log2(100_000!), because if you had any less then there will be at least one permutation that is not derived from one of the seeds, and thus is not equiprobable.
That's a big number. Um... quick google search on "factorial approximation" shows that ln(100_000!) is about 100_000*ln(100_000)100_000, and e is about 2, so you need about 100_000*ln(100_000) which is about 1.2 million bits.
That's really not all that much information  but way more than a computer can typically provide in a short amount of time. It's not a language issue; in fact, it's more of a hardware issue. Just like us, computers can only perceive the world through external devices (senses), and so they can't "make up" bits of entropy (aka truly random numbers) on their own. They need some sort of device that feeds in bits of entropy gradually from thermal noise or whatever.
Linux implements this, in the form of /dev/random. It pulls entropy from IRQ timings, mouse movements, keyboard event interarrival times, and more stuff that I don't understand. I see one person who made his disks go crazy to measure the rate of entropy "production", and he saw 11.5 KB/sec. So that would take 20 minutes to come up with enough bits to randomize your list of 100_000 things. Not infeasible, though most applications would rather not wait that long!
Of course, such truly random numbers generally aren't distinguishable from good pseudorandom numbers, so I wanted to mention my favorite way of gauging randomness:
In the early, early days of the web, I stumbled across a site (I think it was an HTML site; can't remember for sure) with a random number contest. People would send in their opinions of the "most random number" between 1 and 100, and after a suitable period of time waiting for all submissions, the winner would be announced  whichever number was chosen by the fewest submitters.
Works for me.
 [reply] 
Re: Random Math Question by fluxion (Monk) on Oct 11, 2005 at 04:00 UTC 
 [reply] [d/l] 
Re: Random Math Question by Moron (Curate) on Oct 11, 2005 at 11:26 UTC 
The concept of randomness carries with it a puzzling paradox. For example, whatever test you develop to test for randomness, then one can argue that if it obeyed the test, it can't be random if it conformed to a defined pattern, conclusion being that the validity of any such test is disproved by reductio ad absurdum. Localised shuffling cannot be even pseudorandom because a minimum definition for randomness has to include that all candidates have an equal chance of selection. Obviously pseudorandom doesn't quite do that, although in a way it does if you can manage to hide some of the calculation details from yourself, so that you can't make skewed predictions like the result being in a certain range.
 [reply] 
Re: Random Math Question by blazar (Canon) on Oct 11, 2005 at 08:51 UTC 
I popped into #perl today to catch the tail end of a question with Dominus and another chatter regarding randomizing a list of 100_000 elements. Dominus stated he would be surprised if any language had any builtin random number generator suitable for randomizing 100K elements because of the amount of entropy needed.
A possibly relevant link that doesn't seem to have been mentioned yet: The Marsaglia Random Number CDROM including the Diehard Battery of Tests.
 [reply] 
Re: Random Math Question by Perl Mouse (Chaplain) on Oct 11, 2005 at 13:26 UTC 
I don't have much to add after Dominus's remarks, expect perhaps rephrasing it a little.
If you have a set of N things, then there are N! possible permutations, or shufflings, of that set. A shuffle algorithm is fair (and random) if every possible permutation can be a result of the algorithm, and with an equal chance.
For instance, if you have the set "A, B, C", the algorithm should return "A, B, C" 1 out of 6 times, "A, C, B" 1 out of 6 times, "B, A, C" 1 out of 6 times, etc.
But the only difference between different runs of the program is the "state" of you random number generator, for whatever "state" is meaningful (seed for a typical PRNG, bitsequence to be read from /dev/u?random). Your sequence of random numbers must be able to distinguish between at least N!  in this case 100,000!. For a PRNG that's available in Perl or many OSses, where the sequence returned depends only on the seed, it means the seed must be large enough to have at least 100,000! different states.
As others have calculated, that requires more than a million bits. And this is only a lower bound, your algorithm must be good enough as well.
Alternatively, you must read more than a million bits from some device that generates random bits. /dev/random on Linux for instance, or a camera in front of a lava lamp. Brownian motion and radioactive decay are other good sources to generate random bits with, although I do not know of any implementations.
 [reply] 
Re: Random Math Question by tbone1 (Monsignor) on Oct 11, 2005 at 13:47 UTC 
Dominus stated he would be surprised if any language had any builtin random number generator suitable for randomizing 100K elements because of the amount of entropy needed.
If they could get our upper management in the code, they'd have enough entropy to cause the heatdeath of the universe.

tbone1, YAPS (Yet Another Perl Schlub)
And remember, if he succeeds, so what.
 Chick McGee
 [reply] 

