Re: Random data from db

The most-appropriate response depends on the volume of data. You don’t want to bring all of the keys into memory, and you don’t want to have to do a full table scan, and you do want a statistically valid random sample. Unfortunately you can’t have all these things at once. The best strategy I have come up with so far (and having done this many times) is as follows:

Obtain the number of rows in the table; call that number m.
Generate random numbers in the range [0..n) through int(rand(n)). Generate about 10% more numbers than you you will need, to allow for duplicates. (The smaller the table, the more likely there will be dupes.)
Sort this list and remove dupes with sort uniq. If you find that you have fewer numbers than you require, repeat the process (and/or increase the “10%” threshhold).
Now, you have to pay for one table-scan. Open a query of select primary_key from table; do not specify order by.
Using a counter i which you increment starting at 0 each time you loop, walk through the table. When i equals the index at the top of the sorted list, select that record and pop the list. (Obviously, if you do have at your disposal the means to “skip forward n records in the recordset,” you should redesign this algorithm to take advantage of that. Since the indexes in the array are sorted, the amount to skip is obvious.)
Repeat until the list is empty. You should not reach the end of the table first. (If you do, die because you have a bug.)

The statistically valid random selection of records and therfore primary-keys to be processed occurs in steps 2 and 3 above. Memory consumption will be for 2 to 3 times the number of integers required for the pool, but it will only be for integers.

Other strategies such as using the SQL RAND() function are extremely expensive for the server, as you can very plainly see from using explain on the queries that you have in mind. Scrolling through the dataset from the beginning, “flipping a coin each time,” will not produce an even distribution ... it will weight toward the front of the table and may never get to the rear. (If you want i random draws from a population of size m, you want to hit the random number generator about i times, not m.) In my experience, server-side stored procedure languages are usually too primitive (if they exist at all) to support this algorithm. You do unfortunately pay for “wire time” moving all those keys from one computer to another only to discard them, so consider which computer to use to do the work if that number is extremely large ... and skip through the recordset if you possibly can do so with your database system. Use explain against every query that you contemplate using, and do them on the actual database with its actual size, not a developer’s much-smaller proxy.


XP is just a number
	PerlMonks