Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: Random sampling a variable length file.

by bcrowell2 (Friar)
on Dec 26, 2009 at 18:19 UTC ( [id://814436]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Method #1: If there is no correlation between one record and the next, then reading from a random position and taking the next record after that should be fine.

Method #2: If there are important correlations between one record and the next, then one way of dealing with that would be to reorder the entire file in random order. For instance, read the file once in order to count the number of records, N, and while you're at it, generate an array that has the offset to each record. Generate a random permutation of the integers from 1 to N. Read back through the file and pull out the records in that order, writing them to a new copy of the file. Now just use method #1 on the randomized version of the file.

Is the file static, or is it changing a lot? If it's static, then method #2 should be fine. If it's changing all the time, and there are also correlations between successive records, then this becomes a more difficult problem. I think there are probably various ways to do it, but I suspect they all involve reinventing the wheel. Either you're going to reinvent filesystem-level support for random access to a file with varying record lengths, or you're going to reinvent a relational database. My suggestion would be to switch to a relational database. If that's not an option, and you really need to roll your own solution, then the optimal solution may depend on other details, e.g., do the changes to the file just involve steadily appending to it?

  • Comment on Re^3: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^4: Random sampling a variable length file.
by BrowserUk (Patriarch) on Dec 26, 2009 at 18:39 UTC
    1. There is no meaningful correlation in the ordering of the records.

      The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.

      But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

    2. The file is static. It is only processed once.
    3. It is often huge. Time is of the essence.
    4. Reading the whole file to pick a sample negates the purpose of picking a sample.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Disclaimer: I am not a statistician. I don't even play one on TV.

      The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.
      But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

      I had the same concern. Intuitively, I do not see how the bias would be negated by reading the next record, since there is a positional dependence between the two. For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

      If you want a random sample from the count of records, it may be difficult to use a selection method that is based on length.

      The file is static. It is only processed once.

      Would it be possible to generate an index in parallel with the creation of the file? If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index? A list of offsets would be sufficient to accomplish this task and the approach would be very straightforward (think maintenance).

        Would it be possible to generate an index in parallel with the creation of the file?

        No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

        If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

        No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

        I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

        Agreed. In extremis, it doesn't.

        But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

        The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

        I realise that it is corrolated to the length of its positional predecessor, but does that matter?

        If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

        The type of information being inferred (estimated) from the sample:

        • min/ave/max record length.
        • The number of records within the file.
        • The ascii-betical (or numerical) distribution of some field (or fields) within the file.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        That's a good point, bobf. I think there are two possibilities.

        (1) He only needs to get a random sample from this file once.

        (2) He needs to get random samples from this file more than once, and needs each one to be random not only in and of itself but also in the sense of being uncorrelated with the other samples.

        If it's #1, then I think it works to take a random byte position and then read the next record. If it's #2, then he can't use that method, and I think he clearly would be better off creating in index (or using the facilities of a database or filesystem).

      Yeah, if they're uncorrelated, then there's no bias introduced. You may need to special-case the situation where you randomly land on the final record, in which case you have to pick the first record.
      1. There is no meaningful correlation in the ordering of the records.

      How certain are you that this is true? If there is no correlation between any characteristic of interest in a record and the record's position within the file, then taking a sequential sample from an arbitrary location in the file (like the beginning) is entirely unbiased by record size. It's also a very efficient way (computationally, not statistically) to sample the file.

      You ask a number of highly technical questions, like "[h]ow many records should you pick?" Answers to this typically range from rules of thumb to equations for computing a sample size that meets some specification. What to use is highly dependent on what you are trying to do. Meeting regulatory requirements is very different from monitoring operations. Can you say more about what you are trying to do?

        How certain are you that this is true? If there is no correlation between any characteristic of interest in a record

        Simple. I cannot know what will be inside the file! Because the user may apply the process to any file of their choosing.

        So, Just as the polster might discover that the "random selection" they make of the populous, happens to coincidentally consist of the entire membership of some extremist political organisation, they cannot know it will until they take the sample.

        Put another way, there may be all manner of correlations, but none of them are known, and so cannot be utilised.

        So, at this point, the problem is how to take a statistically valid, random sample of records of any file, without resorting to reading the entire file. I've described the inferences to be drawn elsewhere.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://814436]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-05-26 14:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found