Re: Random sampling a variable record-length file.

by wfsp (Abbot)
on Dec 27, 2009 at 13:13 UTC

in reply to Random sampling a variable record-length file.

Perhaps a batch of records would be a valid sample of already random records? Wouldn't the first hundred be as valid a sample as any chosen by any other method?

Rather than count the first hundred you could do as they allegedly do for the Labour vote in parts of South Wales - weigh them. In this case read 4KB worth.

If you wanted more than one batch you could take a batch from the middle and the end too. You could do your stats on each, compare them and if there is a close enough correlation your're done.

You could change the size and number of batches to suit the time available/accuracy required (start expensive and reduce as confidence is established).

Likely not the answer you're looking for but my background in this sort of thing revolved around buckets of rivets rather than CSV files. :-)

  Comment on Re: Random sampling a variable record-length file.

Replies are listed 'Best First'.
Re^2: Random sampling a variable record-length file.
by BrowserUk (Patriarch) on Dec 27, 2009 at 16:04 UTC

    I follow your meaning, but I don't think it gells with the thoery of Normal distributions & Sampling

    buckets of rivets rather than CSV files. :-)

    I think that the distinction is that grabing a handful from a bucket of rivets does not imply any positional correlation between the elements of the sample--they tend to mix random(ish)ly as they fall into the bucket.

    Machine tools tend to wear with use, so its pretty standard practice to set-up the machine tool to operate at one end of the tolorance, so that as the tool wears, it slowly drifts towards the other end. If you took a sample entirely from the beginning of the run--or the end--then the sample would not be representative--in terms of average/mean/mode/variance--of the entire run.

    But grabbing a handful from the collection hopper where they will have tended to randomly mix should be representative.

    Similarly, a contiguous sequence from the beginning, end or middle of a file is probabilistically less likely to be a representative sample, than one picked at random from the entire file.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

