Re: Strategy for randomizing large files via sysseek

in reply to Strategy for randomizing large files via sysseek

Here's my divide-and-conquer approach: Split incoming lines randomly among 50 tempfiles. Recurse on the tempfiles. When the tempfiles are smaller than 1000 lines, shuffle in memory. Spit out the shuffled tempfiles sequentially.

Presumably, adjusting the knobs (higher) to match your system will improve performance.

I believe that this is a fair randomness (assuming fairness of shuffle() and rand), uses constant memory, runs in O(n log n) time, and uses "O(n)" disk. (I may be wrong about these -- comments welcome.) The big advantage is that there is no random seeking, so you get the benefit of ~~caches~~buffers.

#!/usr/bin/perl
use strict;
use warnings;

# Shuffle unlimited number of lines from ARGV to STDOUT
# using tempfiles

my $SPREAD    = 50;
my $TERMINATE = 1000;

use File::Temp qw( tempfile );
use IO::File;
use List::Util qw( shuffle );

sub shuffle_lines {
    my ( $infh, $outfh ) = @_;

    my @temp = map { scalar tempfile } 1 .. $SPREAD;
    my @count = (0) x $SPREAD;

    while (<$infh>) {
        my $i = int rand $SPREAD;
        $temp[$i]->print($_);
        $count[$i]++;
    }

    for my $i ( 0 .. $#count ) {
        my $fh = $temp[$i];
        seek $fh, 0, 0;

        if ( $count[$i] <= $TERMINATE ) {
            print $outfh shuffle <$fh>;
        }
        else {
            shuffle_lines( $fh, $outfh );
        }
    }
}

shuffle_lines( \*ARGV, \*STDOUT );
[download]

Comment on Re: Strategy for randomizing large files via sysseek Select or Download Code

In Section Seekers of Perl Wisdom