|Syntactic Confectionery Delight|
Re^6: Random sampling a variable length file.by BrowserUk (Pope)
|on Dec 26, 2009 at 19:54 UTC||Need Help??|
Would it be possible to generate an index in parallel with the creation of the file?
No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.
If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?
No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.
I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.
Agreed. In extremis, it doesn't.
But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.
The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.
I realise that it is corrolated to the length of its positional predecessor, but does that matter?
If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?
The type of information being inferred (estimated) from the sample:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.