Keep It Simple, Stupid PerlMonks

### Re^2: Random sampling a variable length file.

by BrowserUk (Patriarch)
 on Dec 26, 2009 at 17:24 UTC Need Help??

That's a neat method of picking a random selection. But for statistical purposes, if you have to visit every record in order to generate your sample, you might as well just apply the statistical process to the entire set and forget random sampling.

If you seek to a random position within the file and then read 2 records discarding the first. the second will always be a complete record. That way, you can pick a 100 or 1000 element sample without visiting millions of records. But as the records are variable length, they won't all have the same chance of being picked.

So then the question becomes:

1. How much affect does the variability of length have upon the statistical validity of the sample?

Can this be bounded such that the statistics remain (or become) valid?

2. How many records should you pick?

Given a huge file with variable length records, you don't know how many it contains. But as you gather the sample, you can more and more accurately estimate that number statistically.

Can that be used to determine the size of the sample required?

3. Can the affects of the variable length be compensated for?

Can you use statistics applied to the sample gathered so far, to adjust the sampling process to correct its validity?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
• Comment on Re^2: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^3: Random sampling a variable length file.
by bcrowell2 (Friar) on Dec 26, 2009 at 18:19 UTC

Method #1: If there is no correlation between one record and the next, then reading from a random position and taking the next record after that should be fine.

Method #2: If there are important correlations between one record and the next, then one way of dealing with that would be to reorder the entire file in random order. For instance, read the file once in order to count the number of records, N, and while you're at it, generate an array that has the offset to each record. Generate a random permutation of the integers from 1 to N. Read back through the file and pull out the records in that order, writing them to a new copy of the file. Now just use method #1 on the randomized version of the file.

Is the file static, or is it changing a lot? If it's static, then method #2 should be fine. If it's changing all the time, and there are also correlations between successive records, then this becomes a more difficult problem. I think there are probably various ways to do it, but I suspect they all involve reinventing the wheel. Either you're going to reinvent filesystem-level support for random access to a file with varying record lengths, or you're going to reinvent a relational database. My suggestion would be to switch to a relational database. If that's not an option, and you really need to roll your own solution, then the optimal solution may depend on other details, e.g., do the changes to the file just involve steadily appending to it?

1. There is no meaningful correlation in the ordering of the records.

The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.

But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

2. The file is static. It is only processed once.
3. It is often huge. Time is of the essence.
4. Reading the whole file to pick a sample negates the purpose of picking a sample.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Disclaimer: I am not a statistician. I don't even play one on TV.

The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

The file is static. It is only processed once.

Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).

Yeah, if they're uncorrelated, then there's no bias introduced. You may need to special-case the situation where you randomly land on the final record, in which case you have to pick the first record.
1. There is no meaningful correlation in the ordering of the records.

How certain are you that this is true? If there is no correlation between any characteristic of interest in a record and the record's position within the file, then taking a sequential sample from an arbitrary location in the file (like the beginning) is entirely unbiased by record size. It's also a very efficient way (computationally, not statistically) to sample the file.

You ask a number of highly technical questions, like "[h]ow many records should you pick?" Answers to this typically range from rules of thumb to equations for computing a sample size that meets some specification. What to use is highly dependent on what you are trying to do. Meeting regulatory requirements is very different from monitoring operations. Can you say more about what you are trying to do?

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://814431]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-05-25 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found