Random data from db

mendeepak has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Random data from db by Anonymous Monk on Mar 01, 2012 at 07:42 UTC
Yes, there is a way to retrieve random data from MySQL db using Perl, see Tutorials	[reply]
Re^2: Random data from db by mendeepak (Scribe) on Mar 01, 2012 at 10:41 UTC
i didn't got any answer in Tutorial that is why posted the question...	[reply]
Re^3: Random data from db by marto (Cardinal) on Mar 01, 2012 at 10:45 UTC
Using DBI you can perform SQL queries. Create a SQL query which returns the data you wish, execute the query in Perl using DBI.	[reply]
Re^3: Random data from db by Anonymous Monk on Mar 01, 2012 at 10:44 UTC
how do you mean, Database Programming?	[reply]
Re: Random data from db by rovf (Priest) on Mar 01, 2012 at 10:46 UTC
Not strictly a Perl question. A quick search using Google brought for example the following solution: http://akinas.com/pages/en/blog/mysql_random_row/ -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l]
Re: Random data from db by JavaFan (Canon) on Mar 01, 2012 at 11:03 UTC
Untested: `my $dbh = ... get handle to database ...; my $sth = $dbh->prepare("SELECT * FROM TABLE"); $sth->execute; my @buffer; my $seen; while (my $row = $sth->selectrow_arrayref) { $seen++; if (@buffer < 100) { push @buffer, $row; next; } my $index = int rand $seen; if ($index < @buffer) { $buffer[$index] = $row; } }` [download]	[reply] [d/l]
Re: Random data from db by sundialsvc4 (Abbot) on Mar 01, 2012 at 14:16 UTC
The most-appropriate response depends on the volume of data. You don’t want to bring all of the keys into memory, and you don’t want to have to do a full table scan, and you do want a statistically valid random sample. Unfortunately you can’t have all these things at once. The best strategy I have come up with so far (and having done this many times) is as follows: Obtain the number of rows in the table; call that number `m`. Generate random numbers in the range `[0..n)` through `int(rand(n))`. Generate about 10% more numbers than you you will need, to allow for duplicates. (The smaller the table, the more likely there will be dupes.) Sort this list and remove dupes with `sort uniq`. If you find that you have fewer numbers than you require, repeat the process (and/or increase the “10%” threshhold). Now, you have to pay for one table-scan. Open a query of `select primary_key from table`; do not specify `order by`. Using a counter `i` which you increment starting at 0 each time you loop, walk through the table. When `i` equals the index at the top of the sorted list, select that record and pop the list. (Obviously, if you do have at your disposal the means to “skip forward n records in the recordset,” you should redesign this algorithm to take advantage of that. Since the indexes in the array are sorted, the amount to skip is obvious.) Repeat until the list is empty. You should not reach the end of the table first. (If you do, `die` because you have a bug.) The statistically valid random selection of records and therfore primary-keys to be processed occurs in steps 2 and 3 above. Memory consumption will be for 2 to 3 times the number of integers required for the pool, but it will only be for integers. Other strategies such as using the SQL `RAND()` function are extremely expensive for the server, as you can very plainly see from using `explain` on the queries that you have in mind. Scrolling through the dataset from the beginning, “flipping a coin each time,” will not produce an even distribution ... it will weight toward the front of the table and may never get to the rear. (If you want i random draws from a population of size m, you want to hit the random number generator about i times, not m.) In my experience, server-side stored procedure languages are usually too primitive (if they exist at all) to support this algorithm. You do unfortunately pay for “wire time” moving all those keys from one computer to another only to discard them, so consider which computer to use to do the work if that number is extremely large ... and skip through the recordset if you possibly can do so with your database system. Use `explain` against every query that you contemplate using, and do them on the actual database with its actual size, not a developer’s much-smaller proxy.
Re^2: Random data from db by mendeepak (Scribe) on Mar 03, 2012 at 05:54 UTC
thanks	[reply]


The stupid question is the question not asked
	PerlMonks