Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^6: Random sampling a variable length file.

by BrowserUk (Pope)
on Dec 26, 2009 at 19:54 UTC ( #814468=note: print w/replies, xml ) Need Help??


in reply to Re^5: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Would it be possible to generate an index in parallel with the creation of the file?

No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

Agreed. In extremis, it doesn't.

But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

I realise that it is corrolated to the length of its positional predecessor, but does that matter?

If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

The type of information being inferred (estimated) from the sample:

  • min/ave/max record length.
  • The number of records within the file.
  • The ascii-betical (or numerical) distribution of some field (or fields) within the file.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^6: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^7: Random sampling a variable length file.
by bobf (Monsignor) on Dec 26, 2009 at 22:43 UTC

    creating an offset index requires reading the entire file and negates the purpose of taking a random sample
    Yes, creating the index would require reading the whole file (although I would imagine this could be done very quickly in C). If the point of the random sample was to avoid reading the whole file, then I would also agree that creating the index would negate the purpose of the random sample. However, without trying to split hairs or begin a thread that in the grand scheme of things is moot, I would contend that creating an index may be important in other use cases (not necessarily yours)*. For example, if a completely unbiased random sample were needed, or if performing the statistical calculation(s) on the whole file would be prohibitive (so the cost of creating the index was small in comparison), etc.

    for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum
    Ah, this helps. In this case I would expect the bias introduced by the variation in the length of the record to be relatively small.

    Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?
    The answer to that question is dependent on the use case. If you are not bothered by it, I certainly won't argue otherwise. :-)

    Given this information, I would try the "seek to a random position" approach. I would also suggest that under these conditions the bias due to record length is probably so small that taking the next record (instead of the one selected by the seek) may not be necessary (although programmatically it may be more convenient to do so).

    In my very non-statistical opinion sampling about 1/10,000th of the file might be enough to infer the average record length, but I would want to know something about the distribution of record lengths before calculating the predicted min and max length because the shape of the curve will impact the number of samples that are required to obtain an estimate with an accuracy below a given threshold (set by you, of course). Therefore, depending on how critical the estimates need to be, you might want to employ an empirical approach that keeps taking samples until the error in the calculated values is below your given threshold.

    *Rationale provided only to provide context of my line of thought.

Re^7: Random sampling a variable length file.
by bellaire (Hermit) on Dec 26, 2009 at 22:33 UTC
    You could try a generative approach to answering this question. Generate random data, use your method to sample the data. Then scrutinize your samples to see whether they are, in fact, randomly selected, or if there appears to be a bias.

    My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

      My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

      Thankyou! That's what my intuition is telling me. I was hoping one of the math guys around these parts (the set of whom you may or may ot be a member, I have no way of knowing:), would be able to put some semi-formal buttressing behind that intuition.

      But in the absence of that, the fact that at least one other person has a similar intuition--and define the logic for it in their own words--, and no strong counter argument has been stated, gives me a good enough feeling to make it worth while pursuing it to the next level. Ie. coding up something crude and attempting to define a test scenario to substantiate it.

      Any thoughts on a test scenario that might avoid the mistake of inherently confirming what I'm looking for?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Not sure. You'd probably need some way to estimate whether your sampling distribution is uniform with respect to the index of the sample. Also, you could see whether the average length of the records in your sample jibes with the average length of records in the entire population.

        My other thoughts overnight had to do with the pathological case presented by bobf:

        • To avoid the scenario where you pick the same record 90% of the time if one record is 90% of the file, you need to avoid already-selected records.
        • To give the large record itself a fair chance of being selected, you need to perform the wrapping suggested by bcrowell2, that is, selecting the first record if you land inside the last.

        Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

        So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://814468]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2019-11-13 22:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (74 votes). Check out past polls.

    Notices?