http://www.perlmonks.org?node_id=814486


in reply to Re^4: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

1. There is no meaningful correlation in the ordering of the records.

How certain are you that this is true? If there is no correlation between any characteristic of interest in a record and the record's position within the file, then taking a sequential sample from an arbitrary location in the file (like the beginning) is entirely unbiased by record size. It's also a very efficient way (computationally, not statistically) to sample the file.

You ask a number of highly technical questions, like "[h]ow many records should you pick?" Answers to this typically range from rules of thumb to equations for computing a sample size that meets some specification. What to use is highly dependent on what you are trying to do. Meeting regulatory requirements is very different from monitoring operations. Can you say more about what you are trying to do?