Re: Random sampling a variable length file.

Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (~~zero-based~~ one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ 
+< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = 
+$_; } } } print @a;' 3 filename
[download]

Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

$ cat a
one
two
three
four
five
six
$ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l 
+= 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = 
+rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \
+  ; echo; done) | sort | uniq -c | sort -rn
   1747 five four three 
   1736 five one six 
   1735 five three two 
   1725 four three two 
   1707 one six three 
   1695 five four two 
   1685 five six three 
   1684 five six two 
   1678 one three two 
   1666 five four six 
   1663 four six two 
   1663 four one six 
   1663 five four one 
   1645 four one two 
   1640 four six three 
   1637 five one three 
   1616 five one two 
   1592 six three two 
   1578 one six two 
   1578 four one three 
$
[download]

Comment on Re: Random sampling a variable length file. Select or Download Code

Replies are listed 'Best First'.
Re^2: Random sampling a variable length file. by BrowserUk (Patriarch) on Dec 26, 2009 at 17:24 UTC
That's a neat method of picking a random selection. But for statistical purposes, if you have to visit every record in order to generate your sample, you might as well just apply the statistical process to the entire set and forget random sampling. If you seek to a random position within the file and then read 2 records discarding the first. the second will always be a complete record. That way, you can pick a 100 or 1000 element sample without visiting millions of records. But as the records are variable length, they won't all have the same chance of being picked. So then the question becomes: How much affect does the variability of length have upon the statistical validity of the sample? Can this be bounded such that the statistics remain (or become) valid? How many records should you pick? Given a huge file with variable length records, you don't know how many it contains. But as you gather the sample, you can more and more accurately estimate that number statistically. Can that be used to determine the size of the sample required? Can the affects of the variable length be compensated for? Can you use statistics applied to the sample gathered so far, to adjust the sampling process to correct its validity? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re^3: Random sampling a variable length file. by bcrowell2 (Friar) on Dec 26, 2009 at 18:19 UTC
Method #1: If there is no correlation between one record and the next, then reading from a random position and taking the next record after that should be fine. Method #2: If there are important correlations between one record and the next, then one way of dealing with that would be to reorder the entire file in random order. For instance, read the file once in order to count the number of records, N, and while you're at it, generate an array that has the offset to each record. Generate a random permutation of the integers from 1 to N. Read back through the file and pull out the records in that order, writing them to a new copy of the file. Now just use method #1 on the randomized version of the file. Is the file static, or is it changing a lot? If it's static, then method #2 should be fine. If it's changing all the time, and there are also correlations between successive records, then this becomes a more difficult problem. I think there are probably various ways to do it, but I suspect they all involve reinventing the wheel. Either you're going to reinvent filesystem-level support for random access to a file with varying record lengths, or you're going to reinvent a relational database. My suggestion would be to switch to a relational database. If that's not an option, and you really need to roll your own solution, then the optimal solution may depend on other details, e.g., do the changes to the file just involve steadily appending to it?	[reply]
Re^4: Random sampling a variable length file. by BrowserUk (Patriarch) on Dec 26, 2009 at 18:39 UTC
There is no meaningful correlation in the ordering of the records. The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones. But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked? The file is static. It is only processed once. It is often huge. Time is of the essence. Reading the whole file to pick a sample negates the purpose of picking a sample. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re^5: Random sampling a variable length file. by bobf (Monsignor) on Dec 26, 2009 at 22:05 UTC
Re^6: Random sampling a variable length file. by BrowserUk (Patriarch) on Dec 27, 2009 at 00:54 UTC
Some notes below your chosen depth have not been shown here
Re^6: Random sampling a variable length file. by bcrowell2 (Friar) on Dec 26, 2009 at 22:14 UTC
Re^5: Random sampling a variable length file. by bcrowell2 (Friar) on Dec 26, 2009 at 18:43 UTC
Re^5: Random sampling a variable length file. by eye (Chaplain) on Dec 27, 2009 at 07:02 UTC
Re^6: Random sampling a variable length file. by BrowserUk (Patriarch) on Dec 27, 2009 at 09:37 UTC
Some notes below your chosen depth have not been shown here


Your skill will accomplish what the force of many cannot
	PerlMonks