Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How likely is rand() to repeat?

by desertrat (Sexton)
on Mar 08, 2012 at 20:53 UTC ( [id://958533]=perlquestion: print w/replies, xml ) Need Help??

desertrat has asked for the wisdom of the Perl Monks concerning the following question:

For nefarious purposes of my own (actually to make non-predictable unique keys for a db table 8-) I'm doing the following to generate random 25-character strings:

my @chars=("A".."Z","a".."z",0..9); my $key = join("", @chars[ map { rand @chars } ( 1 .. 25 ) ]);

How likely is this to ever come up with the same 25 char string more than once? Can I realistically treat this as a non-repeating random function?

Replies are listed 'Best First'.
Re: How likely is rand() to repeat?
by BrowserUk (Patriarch) on Mar 08, 2012 at 21:36 UTC

    Mathmatically, I think the odds of any given pick being a duplicate of the previous pick are ~1/6.45e+44, which is vanishingly unlikely. Obviously, the more you pick the more likely a duplicate, but given the starting odds, even if you are picking trillions they are still very unlikely.

    Something like 1e12*2(*)/6.45e44 ~= 1/3.22e32. (For reference, that about a million times the number of grams in the entire Earth!)

    ((*) For the Birthday paradox)

    Of course, it also depends somewhat upon the validity of your rand(), but even using the notoriously poor rand() built-in to Win32 perl, I didn't get a single dup in 1e7 picks:

    undef %h; ++$h{ join'', map $chars[ rand @chars ], 1 .. 25 } for 1 .. 1e7; print scalar keys %h;; 10000000

    You could use a better rand, like Math::Random::MT, but it is probably unnecessary unless you are picking trillions. It is also much slower than the built-in on my machine.

    Finally, no matter how good your rand(), even with those vanishingly small odds, there are no guarantees that you won't get a duplicate. The odds may be very small against it happening, but it still can happen. You are very, very unlikely to see it happen twice in your lifetime though.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Ok, the numbers will unlikely even get into the millions, so I'm probably good. good enough that I (or rather my users) can live with the one DB error in my lifetime...

      Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

      The OS in question is 64-bit Centos 6, for what that's worth.

        Or run this till the cows come home.

        perl -MData::UUID -le 'print Data::UUID->new->create_str'

        Or write a sequence generator.

        Update: for less predictability in recreating-

        perl -MData::UUID -le 'print Data::UUID->new->create_from_name_str("ta +blename?",rand())'
        Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

        Indeed a wise precaution. If performance isn't an issue, using Math::Random::MT is another.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: How likely is rand() to repeat?
by JavaFan (Canon) on Mar 08, 2012 at 21:56 UTC
    Well, there are 6225 == 645345427773512447880377451634304602899218432 different strings of length 25, over an alphabet of 62 characters. If you assume the rand() function was perfect, you want to solve the birthday paradox for a body that has 645345427773512447880377451634304602899218432 days in a year. If you generate N strings, a rough estimation of the probability that all of them are different is: e-N2/1290690855547024895760754903268609205798436864

    But that's all irrelevant. Your random number generator is likely to only have 24, 32, 48 or 64 bits of entropy. Which means that most of the possible strings will never be generated. Even with 64 bits of entropy, there are only 18446744073709551616 strings possible. Which makes is far more likely for collisions to happen. In fact, Wikipedia tells us that with a 64 bit hash, there's a one procent chance you have a collision if you have 6.1E8 strings.

    For more details, see MathWorld.

      But that's all irrelevant. Your random number generator is likely to only have 24, 32, 48 or 64 bits of entropy. Which means that most of the possible strings will never be generated. Even with 64 bits of entropy, there are only 18446744073709551616 strings possible. Which makes is far more likely for collisions to happen.

      That would only be the case if he was only calling rand() once to pick from the 6.45e44 strings. But he is calling rand 25 times. Which as each pick is 'independant' of the previous and next picks, it increases the effective entropy.

      By way of proof. If you throw a die once, you can only get 1 .. 6; but throw it twice and you can get 1 .. 36.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        No. Perl's rand() is documented to be completely repeatable when given the same starting point (see srand). So the number of strings possible using just rand() is bound by roughly 2 ** (the number random bits) or, more correctly, the number of effectively different srand() values.

        Update: And every implementation that I've seen uses a simple LCPRNG (Linear Congruential Psuedo-Random Number Generator). So each has a fixed number of possible values and it will simply cycle through those in the exact same order each time if you actually call it 2**$bits times. The number of effectively different seed values is called the "period".

        Update: Oh, forgot to mention that some implementations intentionally use only a subset of the (upper) bits so the period can often be much higher than 2**$bits. For example, a period of 2**32 is quite common while many such implementations only use 16 bits of each result. And I'm not completely sure, in such a case, whether Perl Config reports that as 16 or 32 "random bits".

        Note: Just FYI, the above two updates were posted within a few minutes of the original submission (and thus nearly 2 hours before the first reply).

        - tye        

        Which as each pick is 'independant' of the previous and next picks, it increases the effective entropy.
        But that's exactly the problem. As I indicated, if rand() would be perfect, it will pick each of the 645345427773512447880377451634304602899218432 different possible string with an equal chance.

        But with most implementations of rand the picks are not independant; the sequence of return values of rand is completely determined by the value of the seed. If srand takes a 64 bit number as argument, then there are at most 264 sequences of return values of rand() possible. But that requires an excellent implementation of rand(), and you got to be lucky enough the mapping to 0..61 doesn't provide duplicates.

Re: How likely is rand() to repeat?
by jettero (Monsignor) on Mar 08, 2012 at 20:59 UTC

    I think that's a binomial problem … beyond that, it really depends on where your random numbers are coming from. Modern OSes have csrngs in them — and, I suppose a few shitty bits of entropy from your mouse movements, if that works at all. Does yours? Does your CPU have one of those weird intel dual-frequency hrngs? Mine has a dedicated entropy generator, but crypto nerds are quick to point out you don't need them if you have a good csrng, just a couple good seed bits and you're fine.

    I'd expect what you have to repeat eventually. Use a hash, say Digest::SHA1, and I'd expect that to be pretty damn random. Feed it the time, a rand() number, and some sequential number. I'd expect that to have a lot more bits of doesn't-repeat than your method, if only because 25 bytes isn't very many when I have databases with a few hundred thousand payment records in them. Ya know?

Re: How likely is rand() to repeat?
by aaron_baugher (Curate) on Mar 10, 2012 at 14:51 UTC

    I'm following this discussion with interest, though I don't completely understand it. But it reminds me that on the old Commodore 64, we got random values by reading a value from an 8-bit register on the sound chip after putting it into 'noise' mode. My understanding was that this was truly random, and although there were only 256 values, people recorded tens of thousands without a pattern ever repeating.

    So I'm wondering: Why don't we have something like that in systems today? That function could probably be added to a BIOS chip today for less than a penny. I suppose that reading voltage fluctuations or keyboard timings is another way to bring in outside randomness, but reading a value from a register sure was a simple way to do it.

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      The trouble with hardware generators is that, as there is no record of their output, you will never detect if the have gone wrong. For example, it is possible for pink noise generators to devolve into harmonic resonance -- through partial hardware failure or externally source interference -- which could combine with particular sampling rates to generate short repeating sequences.

      The problem was first recognised by Von Neumann:

      Von Neumann judged hardware random number generators unsuitable, for, if they did not record the output generated, they could not later be tested for errors. If they did record their output, they would exhaust the limited computer memories available then, and so the computer's ability to read and write numbers. If the numbers were written to cards, they would take very much longer to write and read. On the ENIAC computer he was using, the "middle square" method generated numbers at a rate some hundred times faster than reading numbers in from punched cards.

      Of course, he was working with valves (vacuum tubes), coils etc. It is probably far less likely for modern electronics to be susceptible to such interference or failure, but the underlying problem of being unable to detect failure remains.

      The repeatability of PRNGs has the distinct advantage that you can subject them to a battery of repeatable tests before using them.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re: How likely is rand() to repeat?
by sundialsvc4 (Abbot) on Mar 08, 2012 at 22:44 UTC

    Yeah, I routinely put the insert-queries into a short while loop, but I have never seen one actually go off.   Ever.   Still, I write them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://958533]
Approved by jettero
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-23 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found