Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^5: Random sampling a variable length file.

by bobf (Monsignor)
on Dec 26, 2009 at 22:05 UTC ( #814456=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Disclaimer: I am not a statistician. I don't even play one on TV.

The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

The file is static. It is only processed once.

Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).


Comment on Re^5: Random sampling a variable length file.
Re^6: Random sampling a variable length file.
by bcrowell2 (Friar) on Dec 26, 2009 at 22:14 UTC
    That's a good point, bobf. I think there are two possibilities.

    (1) He only needs to get a random sample from this file once.

    (2) He needs to get random samples from this file more than once, and needs each one to be random not only in and of itself but also in the sense of being uncorrelated with the other samples.

    If it's #1, then I think it works to take a random byte position and then read the next record. If it's #2, then he can't use that method, and I think he clearly would be better off creating in index (or using the facilities of a database or filesystem).

Re^6: Random sampling a variable length file.
by BrowserUk (Pope) on Dec 27, 2009 at 00:54 UTC
    Would it be possible to generate an index in parallel with the creation of the file?

    No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

    If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

    No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

    I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

    Agreed. In extremis, it doesn't.

    But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

    The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

    I realise that it is corrolated to the length of its positional predecessor, but does that matter?

    If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

    The type of information being inferred (estimated) from the sample:

    • min/ave/max record length.
    • The number of records within the file.
    • The ascii-betical (or numerical) distribution of some field (or fields) within the file.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You could try a generative approach to answering this question. Generate random data, use your method to sample the data. Then scrutinize your samples to see whether they are, in fact, randomly selected, or if there appears to be a bias.

      My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

        My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.

        Thankyou! That's what my intuition is telling me. I was hoping one of the math guys around these parts (the set of whom you may or may ot be a member, I have no way of knowing:), would be able to put some semi-formal buttressing behind that intuition.

        But in the absence of that, the fact that at least one other person has a similar intuition--and define the logic for it in their own words--, and no strong counter argument has been stated, gives me a good enough feeling to make it worth while pursuing it to the next level. Ie. coding up something crude and attempting to define a test scenario to substantiate it.

        Any thoughts on a test scenario that might avoid the mistake of inherently confirming what I'm looking for?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      creating an offset index requires reading the entire file and negates the purpose of taking a random sample
      Yes, creating the index would require reading the whole file (although I would imagine this could be done very quickly in C). If the point of the random sample was to avoid reading the whole file, then I would also agree that creating the index would negate the purpose of the random sample. However, without trying to split hairs or begin a thread that in the grand scheme of things is moot, I would contend that creating an index may be important in other use cases (not necessarily yours)*. For example, if a completely unbiased random sample were needed, or if performing the statistical calculation(s) on the whole file would be prohibitive (so the cost of creating the index was small in comparison), etc.

      for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum
      Ah, this helps. In this case I would expect the bias introduced by the variation in the length of the record to be relatively small.

      Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?
      The answer to that question is dependent on the use case. If you are not bothered by it, I certainly won't argue otherwise. :-)

      Given this information, I would try the "seek to a random position" approach. I would also suggest that under these conditions the bias due to record length is probably so small that taking the next record (instead of the one selected by the seek) may not be necessary (although programmatically it may be more convenient to do so).

      In my very non-statistical opinion sampling about 1/10,000th of the file might be enough to infer the average record length, but I would want to know something about the distribution of record lengths before calculating the predicted min and max length because the shape of the curve will impact the number of samples that are required to obtain an estimate with an accuracy below a given threshold (set by you, of course). Therefore, depending on how critical the estimates need to be, you might want to employ an empirical approach that keeps taking samples until the error in the calculated values is below your given threshold.

      *Rationale provided only to provide context of my line of thought.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://814456]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2014-12-19 02:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (70 votes), past polls