http://www.perlmonks.org?node_id=498965


in reply to Random Math Question

I figured that it would be nearly impossible to tell the difference between a "truly" randomization of the list and one that resulted from many of my re-orderings.
Looks like you and Dominus have bumped into the very large open problem in theoretical computer science of pseudorandom generation. Essentially, a pseudorandom generator (PRG) is an algorithm that takes a small "seed" of real randomness (say, n bits) and outputs a longer string (say, 2n bits) that "looks sufficiently random." Usually the definition of "looking sufficiently random" means that no polynomial-time algorithm can distinguish the output of the PRG from truly random bits (with a certain probability). Update: In layman's terms, the question is essentially: "Is it possible to tell (in a reasonable amount of time) how much randomness an algorithm uses just by looking at its output?"

Note that this is similar, but different from the notion of pseudorandom number generators (PRNGs) that you find in Perl and elsewhere. For these, "pseudorandom" means "they seem hard to predict so we hope it's pseudorandom in the above sense." ;)

PRGs are absolutely essential for provably secure cryptography. Before anyone asks, modern cryptographic algorithms (like RSA) are definitely not provably secure. Their security is based on widely-accepted (but unproven) assumptions that certain problems (in this case, discrete logarithm & factoring) are sufficiently difficult to compute.

That PRGs actually exist is even a stronger assumption than P ≠ NP, so this is a very difficult question. Most researchers believe that they do exist on some level. There is a great wealth of research dealing with (assuming PRGs do exist, of course) how much they can expand their random seeds, how simple the PRG algorithms can be, etc.

Anyway, be encouraged: you are in good company thinking that there may be algorithms that don't use much randomness but whose outputs are impossible to distinguish from those that use lots of randomness. On the other hand, since this is an extremely hard problem, don't hold it against Dominus that he wasn't able to come up with a distinguishing algorithm off the top of his head. For that reason, be discouraged as well! Both sides of the problem are quite difficult ;)

Using code, how can you determine the amount of randomness of a given list?
An individual "shuffled" list is neither random or non-random. Randomness is a property of the process, not the individual outputs. When talking about an algorithm which tries to distinguish between a PRG and a truly random source, we (usually) allow for it to take multiple samples from the source before saying which kind of source it thinks it's getting. Even then, we allow for some probabilistic error in its decision.

Regarding the question of randomizing a list of 100_000 elements, you need to sample uniformly from the space of permutations of 100_000 elements. This requires log2(factorial(100000)) = 1516705 bits because that is the entropy of the random variable you wish to sample.

blokhead

Replies are listed 'Best First'.
Re^2: Random Math Question
by BrowserUk (Patriarch) on Oct 10, 2005 at 23:53 UTC
    This requires log2(factorial(100000)) = 1516705 bits ...

    Could you elaborate a little on this please? Specifically, how does that 1516705 bits relate to the period of a PRNG like Mersenne Twister (219937-1)?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      At the risk of devolving into a purely theoretical, impractical exercise (if it's not already too late (which it is)), here goes nothing ;)

      There are two cases...

      1. If pseudorandom generation is impossible, then we can tell (by sampling its output) how much true randomness any algorithm uses (call this the true entropy). In this case, the Mersenne Twister is nowhere near big enough. The MT has 2^19937 configurations, so a single MT has at most 19937 bits of entropy. This is nowhere near the 1.5 million bits required to sample the space in question. There would be a polynomial-time algorithm that would be able to tell (by sampling its output) whether or not your algorithm was using MT.

      2. On the other hand, if the MT is really pseudorandom in the strong sense of my previous comment, then we can talk about not only its true entropy but also its computational entropy, that is, the amount of entropy it can "fool" all polynomial-time algorithms into thinking it uses.

        From what I recall, if pseudorandom generation turns out to be possible in this strong sense, it is quite reasonable for a function's computational entropy to be much higher (say, by a squared factor) than its true entropy. In this case, MT could be sufficient to sample the desired space.

      Essentially, if pseudorandom generation is possible, then bits from the pseudorandom generator are completely interchangable with truly random bits in the polynomial-time realm. If there is ever a case where it made a (statistically) significant difference in an algorithms output, then already that gives you a distinguishing algorithm that contradicts the definition of the pseudorandom generator! Neat, huh?

      blokhead

        Thanks. For your amusement, I came across this paragraph whilst looking around for further reading on entropy (my emphasis):

        Another major source of confusion about entropy change as the result of simply rearranging macro objects comes from information theory "entropy".2 Claude E. Shannon's 1948 paper began the era of quantification of information and in it he adopted the word "entropy" to name the quantity that his equation defined 2. This occurred because a friend, the brilliant mathematician John von Neumann, told him "call it entropy no one knows what entropy really is, so in a debate you will always have the advantage" 3. Wryly funny for that moment, Shannon's unwise acquiescence has produced enormous scientific confusion due to the increasingly widespread usefulness of his equation and its fertile mathematical variations in many fields other than communications 4, 5.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re^2: Random Math Question
by sk (Curate) on Oct 11, 2005 at 05:43 UTC
    blokhead, Thanks for all the wonderful explanations.

    Let's for this discussion assume that all we need is an algorithm that can potentially produce every p! permutation of a sequence (1..p)

    I was wondering whether a two dimensional shuffling (not sure if that is the right term) would help us achieve such a suffle without having to resort to a random variable with entropy 1516705 bits. Here are the steps i have in mind -

    1. Partition the sequence into k-lists (each list containing n-elements) and that each list can be truly randomized. Also WLOG let's assume k <= n

    2. For all i 1 to n and j = 1 to k,

    a. generate integers (X,Y) ~ U from the planar region bounded by x = n and y = k.

    b. Now we swap element(i,j) with element(X,Y).

    Since each (i,j) is equally likely => i*n+j is equally likely. Is there anything wrong with this approach?

    Thanks,

    cheers

    SK