|Keep It Simple, Stupid|
Re^7: Random sampling a variable length file.by bobf (Monsignor)
|on Dec 27, 2009 at 03:43 UTC||Need Help??|
creating an offset index requires reading the entire file and negates the purpose of taking a random sampleYes, creating the index would require reading the whole file (although I would imagine this could be done very quickly in C). If the point of the random sample was to avoid reading the whole file, then I would also agree that creating the index would negate the purpose of the random sample. However, without trying to split hairs or begin a thread that in the grand scheme of things is moot, I would contend that creating an index may be important in other use cases (not necessarily yours)*. For example, if a completely unbiased random sample were needed, or if performing the statistical calculation(s) on the whole file would be prohibitive (so the cost of creating the index was small in comparison), etc.
for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximumAh, this helps. In this case I would expect the bias introduced by the variation in the length of the record to be relatively small.
Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?The answer to that question is dependent on the use case. If you are not bothered by it, I certainly won't argue otherwise. :-)
Given this information, I would try the "seek to a random position" approach. I would also suggest that under these conditions the bias due to record length is probably so small that taking the next record (instead of the one selected by the seek) may not be necessary (although programmatically it may be more convenient to do so).
In my very non-statistical opinion sampling about 1/10,000th of the file might be enough to infer the average record length, but I would want to know something about the distribution of record lengths before calculating the predicted min and max length because the shape of the curve will impact the number of samples that are required to obtain an estimate with an accuracy below a given threshold (set by you, of course). Therefore, depending on how critical the estimates need to be, you might want to employ an empirical approach that keeps taking samples until the error in the calculated values is below your given threshold.
*Rationale provided only to provide context of my line of thought.