Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
laziness, impatience, and hubris
 
PerlMonks  

Random data from db

by mendeepak (Beadle)
on Mar 01, 2012 at 07:02 UTC ( #957114=perlquestion: print w/ replies, xml ) Need Help??
mendeepak has asked for the wisdom of the Perl Monks concerning the following question:

Hi i would like to know is there way to retrieve 100 random data from MySQL db using Perl

Comment on Random data from db
Re: Random data from db
by Anonymous Monk on Mar 01, 2012 at 07:42 UTC

    Yes, there is a way to retrieve random data from MySQL db using Perl, see Tutorials

      i didn't got any answer in Tutorial that is why posted the question...

        Using DBI you can perform SQL queries. Create a SQL query which returns the data you wish, execute the query in Perl using DBI.

Re: Random data from db
by rovf (Priest) on Mar 01, 2012 at 10:46 UTC
Re: Random data from db
by JavaFan (Canon) on Mar 01, 2012 at 11:03 UTC
    Untested:
    my $dbh = ... get handle to database ...; my $sth = $dbh->prepare("SELECT * FROM TABLE"); $sth->execute; my @buffer; my $seen; while (my $row = $sth->selectrow_arrayref) { $seen++; if (@buffer < 100) { push @buffer, $row; next; } my $index = int rand $seen; if ($index < @buffer) { $buffer[$index] = $row; } }
Re: Random data from db
by sundialsvc4 (Monsignor) on Mar 01, 2012 at 14:16 UTC

    The most-appropriate response depends on the volume of data.   You don’t want to bring all of the keys into memory, and you don’t want to have to do a full table scan, and you do want a statistically valid random sample.   Unfortunately you can’t have all these things at once.   The best strategy I have come up with so far (and having done this many times) is as follows:

    1. Obtain the number of rows in the table; call that number m.
    2. Generate random numbers in the range [0..n) through int(rand(n)).   Generate about 10% more numbers than you you will need, to allow for duplicates.   (The smaller the table, the more likely there will be dupes.)
    3. Sort this list and remove dupes with sort uniq.   If you find that you have fewer numbers than you require, repeat the process (and/or increase the “10%” threshhold).
    4. Now, you have to pay for one table-scan.   Open a query of select primary_key from table; do not specify order by.
    5. Using a counter i which you increment starting at 0 each time you loop, walk through the table.   When i equals the index at the top of the sorted list, select that record and pop the list.   (Obviously, if you do have at your disposal the means to “skip forward n records in the recordset,” you should redesign this algorithm to take advantage of that.   Since the indexes in the array are sorted, the amount to skip is obvious.)
    6. Repeat until the list is empty.   You should not reach the end of the table first.   (If you do, die because you have a bug.)

    The statistically valid random selection of records and therfore primary-keys to be processed occurs in steps 2 and 3 above.   Memory consumption will be for 2 to 3 times the number of integers required for the pool, but it will only be for integers.

    Other strategies such as using the SQL RAND() function are extremely expensive for the server, as you can very plainly see from using explain on the queries that you have in mind.   Scrolling through the dataset from the beginning, “flipping a coin each time,” will not produce an even distribution ... it will weight toward the front of the table and may never get to the rear.   (If you want i random draws from a population of size m, you want to hit the random number generator about i times, not m.)   In my experience, server-side stored procedure languages are usually too primitive (if they exist at all) to support this algorithm.   You do unfortunately pay for “wire time” moving all those keys from one computer to another only to discard them, so consider which computer to use to do the work if that number is extremely large ... and skip through the recordset if you possibly can do so with your database system.   Use explain against every query that you contemplate using, and do them on the actual database with its actual size, not a developer’s much-smaller proxy.

      thanks

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://957114]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2014-04-16 11:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (425 votes), past polls