Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Strategy for randomizing large files via sysseek

by TilRMan (Friar)
on Sep 10, 2004 at 03:39 UTC ( #389952=note: print w/replies, xml ) Need Help??

in reply to Strategy for randomizing large files via sysseek

Here's my divide-and-conquer approach: Split incoming lines randomly among 50 tempfiles. Recurse on the tempfiles. When the tempfiles are smaller than 1000 lines, shuffle in memory. Spit out the shuffled tempfiles sequentially.

Presumably, adjusting the knobs (higher) to match your system will improve performance.

I believe that this is a fair randomness (assuming fairness of shuffle() and rand), uses constant memory, runs in O(n log n) time, and uses "O(n)" disk. (I may be wrong about these -- comments welcome.) The big advantage is that there is no random seeking, so you get the benefit of cachesbuffers.

#!/usr/bin/perl use strict; use warnings; # Shuffle unlimited number of lines from ARGV to STDOUT # using tempfiles my $SPREAD = 50; my $TERMINATE = 1000; use File::Temp qw( tempfile ); use IO::File; use List::Util qw( shuffle ); sub shuffle_lines { my ( $infh, $outfh ) = @_; my @temp = map { scalar tempfile } 1 .. $SPREAD; my @count = (0) x $SPREAD; while (<$infh>) { my $i = int rand $SPREAD; $temp[$i]->print($_); $count[$i]++; } for my $i ( 0 .. $#count ) { my $fh = $temp[$i]; seek $fh, 0, 0; if ( $count[$i] <= $TERMINATE ) { print $outfh shuffle <$fh>; } else { shuffle_lines( $fh, $outfh ); } } } shuffle_lines( \*ARGV, \*STDOUT );

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://389952]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2017-04-25 21:59 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (465 votes). Check out past polls.