Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^2: How likely is rand() to repeat?

by desertrat (Acolyte)
on Mar 08, 2012 at 21:50 UTC ( #958545=note: print w/ replies, xml ) Need Help??


in reply to Re: How likely is rand() to repeat?
in thread How likely is rand() to repeat?

Ok, the numbers will unlikely even get into the millions, so I'm probably good. good enough that I (or rather my users) can live with the one DB error in my lifetime...

Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

The OS in question is 64-bit Centos 6, for what that's worth.


Comment on Re^2: How likely is rand() to repeat?
Re^3: How likely is rand() to repeat?
by Your Mother (Canon) on Mar 08, 2012 at 22:01 UTC

    Or run this till the cows come home.

    perl -MData::UUID -le 'print Data::UUID->new->create_str'

    Or write a sequence generator.

    Update: for less predictability in recreating-

    perl -MData::UUID -le 'print Data::UUID->new->create_from_name_str("ta +blename?",rand())'
Re^3: How likely is rand() to repeat?
by BrowserUk (Pope) on Mar 09, 2012 at 11:53 UTC
    Or I can belt&suspender it and re-run the key creating function if my DB throws a constraint violation...

    Indeed a wise precaution. If performance isn't an issue, using Math::Random::MT is another.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      "Just use a Mersenne Twister" doesn't solve the problem by itself. Of course, if all of the IDs are going to be generated by a single, long-running process then the typical MT would give one just under 20K bits of random state (which would solve the problem). But that seems quite unlikely, especially because such an arrangement would allow for a much simpler assignment of unique IDs.

      To solve the problem, you would (probably) have to "use a Mersenne Twister" and come up with a lot more bits of randomness for seed data each time a new process creates its first ID (and use all those random bits to initialize the MT).

      Now, "just use an MT" does help some if each process will be generating a lot of IDs. If, for example, you can convince yourself that only 2**-10 of the IDs will be "the first ID generated" by any given process, then the MT has, in effect, given you 10 more bits of randomness to help in avoiding ID collisions (up to the ~20k bits of state for the usual MT).

      So, a typical implementation of Perl with a rand() period of roughly 2**31 means base-62 IDs longer than 6 characters are just a waste. If we assume that Perl's own srand() actually is successful at picking seeds in a way that effectively covers the ~31-bit space of effective seed values (likely true on a system with /dev/urandom and a modern Perl, less true otherwise), then using long-running Perl interpreter instances to get only 2**-10 of the IDs being "first" IDs, then those 10 extra bits move us from 5.2-character IDs to 6.9-character IDs. So it would only be a waste if we produce IDs longer than 7 characters. Nowhere near making 25-character IDs effective.

      Coming up with reliably random bits is a harder problem for computers, one that is covered by /dev/random. So one could use /dev/random to get 20K (or at least 149) random bits to initialize each MT instance in order to solve the problem (if /dev/random is available). I suspect even getting the bits from /dev/urandom would solve this problem (as I think using just /dev/urandom directly to build the IDs would, in the worst case, act like using a single, shared MT instance).

      Or you could use your own persistent store like a file of "random" bytes that you carefully update each time you read seed data from it. Two processes reading the same seed data could, of course, present a problem. Blocking process initialization waiting for the seed file to be updated in single-file might present problems for scaling such a solution. So there are some minor challenges to be aware of. Or create a service to fetch IDs from.

      - tye        

        "Just use a Mersenne Twister" doesn't solve the problem by itself.

        There is no "problem" to solve.

        The OP has stated that he needs less than a million IDs and is running on a 64-bit OS.

        If he picks one starting point per run from the 2^64 that are possible -- using his built-in rand() or the MT19937-64 -- then he will be using 0.0000000000054210108624275221700372640043497% of the possibilities.

        Another way: If he picks 1 new string, from a new starting position each time, every microsecond, he can expect a duplicate sometime in the next 50,000 years.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://958545]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-09-18 22:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (125 votes), past polls