XP is just a number | |
PerlMonks |
Re: Random data from dbby sundialsvc4 (Abbot) |
on Mar 01, 2012 at 14:16 UTC ( [id://957228]=note: print w/replies, xml ) | Need Help?? |
The most-appropriate response depends on the volume of data. You don’t want to bring all of the keys into memory, and you don’t want to have to do a full table scan, and you do want a statistically valid random sample. Unfortunately you can’t have all these things at once. The best strategy I have come up with so far (and having done this many times) is as follows:
The statistically valid random selection of records and therfore primary-keys to be processed occurs in steps 2 and 3 above. Memory consumption will be for 2 to 3 times the number of integers required for the pool, but it will only be for integers. Other strategies such as using the SQL RAND() function are extremely expensive for the server, as you can very plainly see from using explain on the queries that you have in mind. Scrolling through the dataset from the beginning, “flipping a coin each time,” will not produce an even distribution ... it will weight toward the front of the table and may never get to the rear. (If you want i random draws from a population of size m, you want to hit the random number generator about i times, not m.) In my experience, server-side stored procedure languages are usually too primitive (if they exist at all) to support this algorithm. You do unfortunately pay for “wire time” moving all those keys from one computer to another only to discard them, so consider which computer to use to do the work if that number is extremely large ... and skip through the recordset if you possibly can do so with your database system. Use explain against every query that you contemplate using, and do them on the actual database with its actual size, not a developer’s much-smaller proxy.
In Section
Seekers of Perl Wisdom
|
|