Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

grabbing random n rows from a file

by punkish (Priest)
on Jul 18, 2006 at 21:26 UTC ( #562135=perlquestion: print w/replies, xml ) Need Help??

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am faced with a problem that is a variation of recipe "8.6. Picking a Random Line from a File" from the venerable cookbook ( clever use of rand($.) )

I have a file with "n" sets of "m" rows (lets assume they are sorted by the token that makes them into a set... so, if the rows have some attributes about people, the first token in each row is the name of the person, and there are "m" rows for, say 'punkish' and another "m" rows for 'paco', and so on). I want to grab random "j" rows from each set and write "n" sets of "j" rows out to another file.

I apologize that I am even unable to offer pseudo code to try and figure it out. It would be trivial to do it in a db, but I wouldn't mind knowing how to do this with just a file and the magic of Perl.

Oh! did I mention that (n * m) is a very large number, that is, we are talking about a file with around 8 million rows.

Update: on second glance, this post should really be titled grabbing random "j" rows from a file... oh well.

Update 2: (after bonking himself on the head for not providing a "compleat" problem the first time)

  • Type of file: It is a delimited (say, CSV) file
  • Are the lines fixed length?: No, but each row has the same number of fields, just like a CSV file
  • Are there a fixed number of lones per "record"? Dunno what a "lone" is.
  • Is any of this stuff indexed? It is a text file. How could it be index?
  • Is this something that you need to do one off (or occasionally)? Occasionally... that is why the need for a program. But, would prefer to not use a database such as SQLite.
  • Does the "data base" change over time? Yes, periodically. But, for every run, it is one, immutable file.
  • If it changes can "records" be inserted? Can't change the input file.
  • Why isn't this in a real database? Well, too long to answer here. Eventually it ends up in a database, so this is just the preprocessing part, but it is preferred to not do the preprocessing in a database.
  • Does j change for each set, or do you want to print the same number of lines for each set? "j" doesn't change, however, the number of rows in a set may change. Incoming rows are supposed to be, say, 100 per set, and "j" is fixed at, say 90, but it is possible that a set might have only 80 rows, in which case, all 80 will be chosen. In other words, choose random "j" out of "m" if (j < m) else choose "m"
--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: grabbing random n rows from a file
by japhy (Canon) on Jul 18, 2006 at 21:57 UTC
    I think this method is accurate and fair:
    open my $some_filehandle, "<", "quotefile.txt"; my $set_size = 3; my $set = random_set_of_n($some_filehandle, $set_size); sub random_set_of_n { my ($fh, $size) = @_; my @set; local ($., $_); seek $fh, 0, 0; while (<$fh>) { chomp; push @set, $_; last if @set == $size; } # XXX: @set *should* be shuffled now if you care about ordering while (<$fh>) { chomp; $set[rand @set] = $_ if $size/$. > rand; } return \@set; }
    I think it's a fair distribution. My tests imply it is. Update: the set should be shuffled where I've indicated. It's not necessary if you're going to be plucking elements from it at random later on, though, only if you want a randomly ordered list returned.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: grabbing random n rows from a file
by ikegami (Patriarch) on Jul 18, 2006 at 22:48 UTC

    The fact that your sets are grouped is a great benefit. We can work with a set at a time.

    The fact that your sets are of varying length is a great hindrance. Work needs to be done to locate the end of each set.

    I made the following assumptions:

    • m (and thus j) is rather small. Specifically, keeping m lines in memory is not a problem. ( Confirmed in "Update 2". )
    • m is not the same for every set. ( Confirmed in "Update 2". )
    • You don't want the same random j lines from every set. It's a minor change if you do.
    • You don't care if the random j lines are in their original order. It's a minor change if you do.

    My solution:

    use strict; use warnings; use List::Util qw( shuffle ); my $j = 90; sub extract_id { my ($line) = @_; ... return ...; } my @m; my $id; my $last_id; for (;;) { my $line = <DATA>; $id = extract_id($line) if defined($line); if (@m) { if (!defined($line) || $id ne $last_id) { my $j = $j < @m ? $j : @m; print $m[$_] foreach (shuffle(0..$#m))[0..$j-1]; @m = (); } } last if !defined($line); push(@m, $line); $last_id = $id; }

    Untested. (Update: Tested. Fixed. )

    Memory can be saved by stored file positions in @m instead of the actual lines, but that's not needed based on your "Update 2".

    Alternative:
    print splice(@m, rand(@m), 1) while $j--;

Re: grabbing random n rows from a file
by GrandFather (Saint) on Jul 18, 2006 at 21:46 UTC

    Are the lines fixed length? Are there a fixed number of lines per "record"? Is any of this stuff indexed? Is this something that you need to do one off (or occasionally)? Does the "data base" change over time? If it changes can "records" be inserted? Why isn't this in a real database?

    Without knowing the answers to any of the above here are a couple of approaches:

    • do an indexing pass through the file to build a hash keyed by "person" containing the file position of the start of the record and positions of lines in the record. Seek through the file pulling out randomly selected j lines from each record as required.
    • Scan through the file identifying record starts and create an array of lines within the current record. At the end of each record spit out j random lines as required from the array.

    The first variant is more appropriate if you want to do this multiple times without the "data base" changing between times - keep the index. The second variant is more appropriate for one off use or if the contents of the "data base" is changing and maintaining an index is not fesable.

    Update: s/lones/lines/


    DWIM is Perl's answer to Gödel
Re: grabbing random n rows from a file
by Hue-Bond (Priest) on Jul 18, 2006 at 21:58 UTC

    This reads $j lines, then skips until $m is reached, then starts reading again. Prints to STDOUT, not to a new file. But that isn't your main concern, is it? ;^).

    Update: Oops, missed the "random" bit. Let's try this:

    #!/usr/bin/perl use warnings; use strict; use List::Util qw/shuffle/; my $m = 3; my $j = 2; MAIN: while (1) { my @set; for (my $i = 0; $i < $m ; $i++) { last MAIN unless defined ($_ = <DATA>); push @set, $_; } for ((shuffle @set)[0..$j-1]) { print; } } __DATA__ foo 1 2 3 foo 2 3 4 foo 4 5 6 bar a b c bar b c d bar c d e baz q w e baz w e r baz e r t

    --
    David Serrano

      According to the OP's "Update 2", m is neither neither known nor constant. japhy made the same incorrect assumption above.
Re: grabbing random n rows from a file
by perrin (Chancellor) on Jul 18, 2006 at 21:50 UTC
Re: grabbing random n rows from a file
by swkronenfeld (Hermit) on Jul 18, 2006 at 21:53 UTC
    Does j change for each set, or do you want to print the same number of lines for each set? It sounds like the amount of lines for each set is the same. Assuming that you want to print the same amount of lines for each set, than do the following

    -read an entire set, let $m = the number of lines read
    -let $j = int(rand($m))+1
    -print out $j lines of the first set that you've already read
    -now read the file line by line, buffering the input, and after you've read out $m lines, print out the first $j lines for each set, and throw away the remaining ($m-$j) lines. j random lines from each set. Then you can throw away your buffer and start over.

    As for the size of the file, just don't slurp the entire file into memory, read it line by line.

    Have I answered your question? I have a feeling that I'm oversimplifying it. If so, please give a few more details.

    Updated
Re: grabbing random n rows from a file
by ambrus (Abbot) on Jul 19, 2006 at 08:51 UTC
Re: grabbing random n rows from a file
by Anonymous Monk on Jul 19, 2006 at 18:41 UTC
    #!/usr/bin/perl use strict; use warnings; my ($fh,$href,$aref,$group,$array); open($fh,"</tmp/kinput"); $href = {}; while (<$fh>) { $aref = [split(/,/,$_)]; push(@{$$href{$$aref[0]}},$aref); } while ( ($group,$array) = each(%$href) ) { $aref = $$array[rand($#$array)]; print join(",",@$aref); }
    You can reduce memory consumption by working on a "set" at a time versus building a massive hash, but you'll need to define your sets beforehand and use grep or something else to get lines from the set into the hash. This will of course drive cpu usage up.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://562135]
Approved by socketdave
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2023-02-03 14:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (26 votes). Check out past polls.

    Notices?