Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^9: Random sampling a variable length file.

by bellaire (Hermit)
on Dec 27, 2009 at 13:35 UTC ( #814511=note: print w/replies, xml ) Need Help??


in reply to Re^8: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Not sure. You'd probably need some way to estimate whether your sampling distribution is uniform with respect to the index of the sample. Also, you could see whether the average length of the records in your sample jibes with the average length of records in the entire population.

My other thoughts overnight had to do with the pathological case presented by bobf:

  • To avoid the scenario where you pick the same record 90% of the time if one record is 90% of the file, you need to avoid already-selected records.
  • To give the large record itself a fair chance of being selected, you need to perform the wrapping suggested by bcrowell2, that is, selecting the first record if you land inside the last.

Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.
  • Comment on Re^9: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^10: Random sampling a variable length file.
by BrowserUk (Pope) on Dec 27, 2009 at 15:53 UTC
    Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

    Thankyou again! That makes a great deal of sense.

    My first reaction was that remembering whether I had already picked a record was an awkward prospect given I olny have the byte position and no nknowledge of how long it is, then it dawned on me querying the offset once I've read the partial record make for a perfect signature.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://814511]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2019-10-20 21:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?