Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

A series of random number and others

by lightoverhead (Monk)
on Oct 09, 2008 at 00:17 UTC ( #716106=perlquestion: print w/ replies, xml ) Need Help??
lightoverhead has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I tried to randomly select 20 million lines from a 40 million lines file.

I have two questions.

First, how to select the 20 million lines from the file without bias. I have tried to use perl. But it seems perl is not that good for such thing, so I used R to generate 20 million index to create an index file (rand_sorted.txt) and use it to print the selected lines.

Second, my code is as below:

#!/usr/bin/perl -w use strict; open RND, "rand_sorted.txt" or die "no file found"; my @rnd; @rnd = grep {chomp;} <RND>; close RND; WILLOP: while (<>){ chomp; FORLOP: foreach my $i (@rnd){ if ($i != $rnd[$#rnd]){ if ($. == $i){ print "$_\n"; shift @rnd; last FORLOP; } else { print "$_\n"; last FORLOP; } }

This works fine and fast for me, but I just hate these two loops nested each other, it looks stupid, or maybe it's stupid. Any one can give me an idea how to do it? Thanks.

Comment on A series of random number and others
Download Code
Re: A series of random number and others
by blokhead (Monsignor) on Oct 09, 2008 at 01:02 UTC
    I just posted a simple method to Randomly select N lines from a file, on the fly, as a snippet since I wasn't able to find it anywhere else here on the site. Perhaps I just missed it -- hope I haven't duplicated any effort. It is a generalization of the "usual" trick for choosing 1 random line from a file, on the fly.

    blokhead

Re: A series of random number and others
by GrandFather (Cardinal) on Oct 09, 2008 at 01:25 UTC

    Yep, that's pretty stupid code all right. Actually we seem to be running about one or two questions a week that look like that so don't feel too bad. The answer is generally the same: use a hash. However in this case there is a twist: "20 million lines from 40 million" is a fairly bif hash. But there may be a smarter solution: if you don't need exactly 20 million lines, then you can:

    rand () < 0.5 and print while <>;

    If you need the hash then your current code changes to something like:

    open my $rndLines, '<', "rand_sorted.txt" or die "Can't open rand_sort +ed.txt: $!"; my %lines = map {$_ => undef} grep {chomp; length} <$rndLines>; close $rndLines; exists $lines{$.} and print while <>;

    I find it interesting that you couldn't generate a random number list using Perl however. What was the issue that you struck? Are you aware of Math::Random::MT btw?


    Perl reduces RSI - it saves typing
Re: A series of random number and others
by drblove27 (Sexton) on Oct 09, 2008 at 01:49 UTC
    Well for a straight forward way Perl (at least to my non-expert Perl skills) way, you can do this:
    @lines_of_code # Say this is your 40,000,000 lines of code my $i; my @rlc; ### Random lines of code for ($i = 0; $i < 20000000; $i++){ $rlc[$i] = $loc[int(rand(40000000)+1)]; }
    That should do the trick, may not be the fastest thing in the world...

      Apart from not being what the OP wants and not being Perl so much a transliterated C, that code actually demonstrates the problem the OP was most likely to have had with the default Perl rand. Consider:

      use strict; use warnings; my %randLines; ++$randLines{1 + int rand 40_000_000} for 1 .. 20_000_000; my @hits = map {[$_, $randLines{$_}]} sort {$randLines{$b} <=> $randLines{$a}} keys %randLines; print scalar @hits, " different lines selected\n"; print "Line $_->[0] hit $_->[1] times\n" for @hits[0 .. 9];

      Prints:

      32768 different lines selected Line 3532715 hit 724 times Line 24512940 hit 718 times Line 20959473 hit 712 times Line 28502198 hit 705 times Line 4688721 hit 704 times Line 37175293 hit 700 times Line 26921387 hit 699 times Line 28406983 hit 699 times Line 3172608 hit 696 times Line 31093751 hit 695 times

      Note in particular that only 32768 (very close to 215 btw) different lines were selected out of the 20,000,000 the OP was after!

      The actual results will vary with the specific build of Perl with the result shown being about the worst you are likely to encounter and are because the rand for the build of Perl used to run the sample uses only a 15 bit value. Much better results are obtained if you add the line use Math::Random::MT qw(rand srand); to the sample, but be prepared to wait a while and make sure your system has lots of memory available.


      Perl reduces RSI - it saves typing

        Thanks GrandFather. Your understanding is right, these lines must not be duplicate. I just began to get used to perl,and am not familiar with these modules. It seems I have to study these modules thoroughly.

        As your suggestions of using hash or array, I have tried, but I encountered the memory problem. My machine can not even slurp 40 million lines into an array. That's why I create this index file first.

        So, under such a condition considering memory limitation (3GB). what could be the fastest resolution for such a case? Thank you.

        32768 (very close to 215 btw)

        How close do you need to be?

Re: A series of random number and others
by BrowserUk (Pope) on Oct 09, 2008 at 07:20 UTC

    The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it.

    This code

    1. Generates a packed list of 40e6 integers (315MB) (40e6*4*2 (because of the way Perl initialises scalars)).
    2. F-Y shuffles that list in-place (+0MB) 2 minutes.
    3. "Sorts" the first 20e6 of the shuffled integers by building a bitvector. (+5MB) 30 seconds.
    4. Reads the file and print a line if it bit is set in the bitvector.

    In total, <320MB and ~2.5 minutes on my machine.

    #! perl -slw use strict; use Math::Random::Mt qw[ rand ]; our $N ||= 40e6; our $K ||= $N / 2; ## In-place FY shuffle of a packed ULONG array sub shuffle { my $ref = shift; my $n = length( $$ref ) >> 2; for( 0 .. $n ) { my $p = $_ + int( rand $n - $_ ); my $tmp = substr $$ref, $_*4, 4; substr $$ref, $_*4, 4, substr $$ref, $p*4, 4; substr $$ref, $p*4, 4, $tmp; } return; } ## Allocate a packed array of $N Ulongs my $index = \ (chr(0) x ( 4 * $N )); ## Populate it substr $$index, $_*4, 4, pack 'V', $_ for 0 .. $N - 1; ## Shuffle in place shuffle $index; ## Allocate a bit-vector for linear sort my $ordered = chr(0) x int( ( $N +7 ) / 8 ); ## Linear sort vec( $ordered, unpack( 'V', substr $$index, $_*4, 4 ), 1 ) = 1 for 0 .. $K - 1; ## Read the file line by line while( <> ) { ## print it if it was picked print if vec( $ordered, $., 1 ); }

    The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process.

    With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: A series of random number and others
by cdarke (Prior) on Oct 09, 2008 at 08:30 UTC
    I admit my solution is not mathematically pure, and is similar to Grandfather's first suggestion. Rather than storing up all those numbers, my approach is to note that we have to choose half of the lines. So do we use a line or not? 50/50 chance. That could easily give a shortfall, but I have a hack for that:
    use warnings; use strict; my $limit = 40_000_000; my $how_many = $limit/2; my $hits = 0; open (my $handle, 'rand_sorted.txt') or die "Unable to open rand_sorte +d.txt: $!"; # Using a C style loop to avoid a large temp list (Grandfather) for (my $i = 0; $i < $limit && $hits < $how_many; $i++) { my $record = <$handle>; if (rand(2) > 1 || ($limit-$i) <= $hits ) { $hits++; # Do what you must to the record print "$i: $record\n"; } } close ($handle);

      Perl is smart about for (1 .. $count) and doesn't generate a list so you don't need to resort to nasty C for loops, at least in that case. ;)


      Perl reduces RSI - it saves typing
Re: A series of random number and others
by salva (Abbot) on Oct 09, 2008 at 08:46 UTC
    The rand built-in is not the only random number generator available from Perl. There are several CPAN modules implementing different algorithms, for instance: Math::GSL::RNG or Math::RngStream.
Re: A series of random number and others (10MB / 8seconds)
by BrowserUk (Pope) on Oct 09, 2008 at 10:24 UTC

    This version fairly picks (exactly) 50% from 40e6 in 8 seconds*(10MB); 50% from 100e6 in 22 seconds*(25MB); 50% from 1e9 in under 4 minutes*(1/2GB). It can handle all the way up to 25e9 using <3GB in under 2 hours*:

    (*)plus the time taken to read the file.

    #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use constant FOUR_GB => 2**32; $|++; our $N ||= 40e6; our $K ||= $N / 2; ## A bit vector big enough to hold any number in the range my $vector = chr(0) x (( $N + 7 ) / 8); ## Populate it with random bit values ## As we are after 50% this will get us very close very quickly substr $vector, $_* 4, 4, pack( 'V', rand( FOUR_GB ) ) for 0 .. ( length( $vector ) / 4 ) -1; ## How many did we actually pick? my $nBits = unpack '%32b*', $vector; ## Now (fairly) adjust up or down accordingly while( $nBits++ < $K ) { ## pick bit to set my $r = int( rand $N ); ## If it is already set discount it $nBits -= vec( $vector, $r, 1 ); ## and set it anyway vec( $vector, $r, 1 )=1; } while( $nBits-1 > $K ) { ## pick bit to clear my $r = int( rand $N ); ## If it is set discount it $nBits -= vec( $vector, $r, 1 ); ## and clear it anyway vec( $vector, $r, 1 ) = 0; } ## Read the file line by line $\=undef; while( <> ) { ## print it if it was picked print if vec( $vector, $., 1 ); }

    It can also handle ratios other than 50%, but will start getting far less efficient as you move further away from 50% (up or down).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://716106]
Approved by Lawliet
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2014-12-20 07:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls