Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Random sampling a variable length file.

by ambrus (Abbot)
on Dec 26, 2009 at 14:32 UTC ( [id://814424]=note: print w/replies, xml ) Need Help??


in reply to Random sampling a variable record-length file.

Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (zero-based one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ +< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = +$_; } } } print @a;' 3 filename

Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

$ cat a one two three four five six $ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l += 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = +rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \ + ; echo; done) | sort | uniq -c | sort -rn 1747 five four three 1736 five one six 1735 five three two 1725 four three two 1707 one six three 1695 five four two 1685 five six three 1684 five six two 1678 one three two 1666 five four six 1663 four six two 1663 four one six 1663 five four one 1645 four one two 1640 four six three 1637 five one three 1616 five one two 1592 six three two 1578 one six two 1578 four one three $

Replies are listed 'Best First'.
Re^2: Random sampling a variable length file.
by BrowserUk (Patriarch) on Dec 26, 2009 at 17:24 UTC

    That's a neat method of picking a random selection. But for statistical purposes, if you have to visit every record in order to generate your sample, you might as well just apply the statistical process to the entire set and forget random sampling.

    If you seek to a random position within the file and then read 2 records discarding the first. the second will always be a complete record. That way, you can pick a 100 or 1000 element sample without visiting millions of records. But as the records are variable length, they won't all have the same chance of being picked.

    So then the question becomes:

    1. How much affect does the variability of length have upon the statistical validity of the sample?

      Can this be bounded such that the statistics remain (or become) valid?

    2. How many records should you pick?

      Given a huge file with variable length records, you don't know how many it contains. But as you gather the sample, you can more and more accurately estimate that number statistically.

      Can that be used to determine the size of the sample required?

    3. Can the affects of the variable length be compensated for?

      Can you use statistics applied to the sample gathered so far, to adjust the sampling process to correct its validity?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Method #1: If there is no correlation between one record and the next, then reading from a random position and taking the next record after that should be fine.

      Method #2: If there are important correlations between one record and the next, then one way of dealing with that would be to reorder the entire file in random order. For instance, read the file once in order to count the number of records, N, and while you're at it, generate an array that has the offset to each record. Generate a random permutation of the integers from 1 to N. Read back through the file and pull out the records in that order, writing them to a new copy of the file. Now just use method #1 on the randomized version of the file.

      Is the file static, or is it changing a lot? If it's static, then method #2 should be fine. If it's changing all the time, and there are also correlations between successive records, then this becomes a more difficult problem. I think there are probably various ways to do it, but I suspect they all involve reinventing the wheel. Either you're going to reinvent filesystem-level support for random access to a file with varying record lengths, or you're going to reinvent a relational database. My suggestion would be to switch to a relational database. If that's not an option, and you really need to roll your own solution, then the optimal solution may depend on other details, e.g., do the changes to the file just involve steadily appending to it?

        1. There is no meaningful correlation in the ordering of the records.

          The problem with the picking random (byte) positions, is that with varible length records, longer records have a greater chance of being picked than shorter ones.

          But maybe that is negated to some extent because you would be using the next record--which might be longer or shorter--rather than the one picked?

        2. The file is static. It is only processed once.
        3. It is often huge. Time is of the essence.
        4. Reading the whole file to pick a sample negates the purpose of picking a sample.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://814424]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-05-25 05:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found