BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I want to take a statistically valid random sample of records, from a huge file, containing variable length records. Thoughts?

  • Comment on Random sampling a variable record-length file.

Replies are listed 'Best First'.
Re: Random sampling a variable length file.
by ambrus (Abbot) on Dec 26, 2009 at 14:32 UTC

    Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

    (Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (zero-based one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

    Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

    To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

    perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ +< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = +$_; } } } print @a;' 3 filename

    Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

      That's a neat method of picking a random selection. But for statistical purposes, if you have to visit every record in order to generate your sample, you might as well just apply the statistical process to the entire set and forget random sampling.

      If you seek to a random position within the file and then read 2 records discarding the first. the second will always be a complete record. That way, you can pick a 100 or 1000 element sample without visiting millions of records. But as the records are variable length, they won't all have the same chance of being picked.

      So then the question becomes:

      1. How much affect does the variability of length have upon the statistical validity of the sample?

        Can this be bounded such that the statistics remain (or become) valid?

      2. How many records should you pick?

        Given a huge file with variable length records, you don't know how many it contains. But as you gather the sample, you can more and more accurately estimate that number statistically.

        Can that be used to determine the size of the sample required?

      3. Can the affects of the variable length be compensated for?

        Can you use statistics applied to the sample gathered so far, to adjust the sampling process to correct its validity?

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Method #1: If there is no correlation between one record and the next, then reading from a random position and taking the next record after that should be fine.

        Method #2: If there are important correlations between one record and the next, then one way of dealing with that would be to reorder the entire file in random order. For instance, read the file once in order to count the number of records, N, and while you're at it, generate an array that has the offset to each record. Generate a random permutation of the integers from 1 to N. Read back through the file and pull out the records in that order, writing them to a new copy of the file. Now just use method #1 on the randomized version of the file.

        Is the file static, or is it changing a lot? If it's static, then method #2 should be fine. If it's changing all the time, and there are also correlations between successive records, then this becomes a more difficult problem. I think there are probably various ways to do it, but I suspect they all involve reinventing the wheel. Either you're going to reinvent filesystem-level support for random access to a file with varying record lengths, or you're going to reinvent a relational database. My suggestion would be to switch to a relational database. If that's not an option, and you really need to roll your own solution, then the optimal solution may depend on other details, e.g., do the changes to the file just involve steadily appending to it?

Re: Random sampling a variable record-length file.
by toma (Vicar) on Dec 27, 2009 at 04:31 UTC
    If you need statistically valid random numbers, don't use perl's rand(), it is not good for this purpose. It has patterns in it that will invalidate your results. I ran into this when I did a Monte Carlo simulation, and I was amazed at how bad the problem really is.

    I tried several other algorithms for generating random numbers. Randomness turns out to be harder than it looks.

    Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. óJohn von Neumann

    See for true random numbers.

    It should work perfectly the first time! - toma

      Indeed. Especially bad is the default rand on win32 with just 15-bits and a very poor LCG. I use Math::Random::MT.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Random sampling a variable record-length file.
by wfsp (Abbot) on Dec 27, 2009 at 13:13 UTC
    Perhaps a batch of records would be a valid sample of already random records? Wouldn't the first hundred be as valid a sample as any chosen by any other method?

    Rather than count the first hundred you could do as they allegedly do for the Labour vote in parts of South Wales - weigh them. In this case read 4KB worth.

    If you wanted more than one batch you could take a batch from the middle and the end too. You could do your stats on each, compare them and if there is a close enough correlation your're done.

    You could change the size and number of batches to suit the time available/accuracy required (start expensive and reduce as confidence is established).

    Likely not the answer you're looking for but my background in this sort of thing revolved around buckets of rivets rather than CSV files. :-)

      I follow your meaning, but I don't think it gells with the thoery of Normal distributions & Sampling

      buckets of rivets rather than CSV files. :-)

      I think that the distinction is that grabing a handful from a bucket of rivets does not imply any positional correlation between the elements of the sample--they tend to mix random(ish)ly as they fall into the bucket.

      Machine tools tend to wear with use, so its pretty standard practice to set-up the machine tool to operate at one end of the tolorance, so that as the tool wears, it slowly drifts towards the other end. If you took a sample entirely from the beginning of the run--or the end--then the sample would not be representative--in terms of average/mean/mode/variance--of the entire run.

      But grabbing a handful from the collection hopper where they will have tended to randomly mix should be representative.

      Similarly, a contiguous sequence from the beginning, end or middle of a file is probabilistically less likely to be a representative sample, than one picked at random from the entire file.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Random sampling a variable record-length file.
by lowphive (Monk) on Dec 27, 2009 at 03:59 UTC
    This might sound stupid, but Donald Knuth achieved that for his book 3:16 by taking the 16th verse out of the 3rd chapter of every book of the bible.
Re: Random sampling a variable record-length file.
by pajout (Curate) on Dec 30, 2009 at 13:35 UTC
    I see following algorithm, assuming ~10G file of records of 50-150 bytes, which ordering/length can be somehow dependent:
    1. Estimate count of records in result sample: $n = 1000; 2. Estimate count of records in "primar" sample: $p = $n * 100; 3. Generate array of byte positions @pos, uniformly distributed in int +erval [1, length of file], satisfying condition $p == @pos; 4. If it is worth for seeking in the file, sort @pos. 5. my @res; my %res; 6. foreach @pos 6.1 read the record starting behind $_-th byte and remember $start, wh +ich is position of the first byte of the record in the file 6.2 next if exists $res{$start}; 6.3 $res{$start} = 1; 6.4 my $rand = some_uniform_rand()/($start - $_). It is crucial step - + IF ordering of records is somehow dependent on record length, it can + normalize it. Records following longer records have lowered probabil +ity to be inserted into @res. 6.5 if $n < @res insert {weight => $rand, rec => $record} into @res, keeping asc +ending order by weight 6.6 elsif $rand > $rec[0]{weight} shift @rec and insert {weight => $rand, rec => $record} into @r +es, keeping ascending order by weight 7. you have @res
    UPDATE: 6.5 if $n > @res