Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Not sure. You'd probably need some way to estimate whether your sampling distribution is uniform with respect to the index of the sample. Also, you could see whether the average length of the records in your sample jibes with the average length of records in the entire population.

My other thoughts overnight had to do with the pathological case presented by bobf:

  • To avoid the scenario where you pick the same record 90% of the time if one record is 90% of the file, you need to avoid already-selected records.
  • To give the large record itself a fair chance of being selected, you need to perform the wrapping suggested by bcrowell2, that is, selecting the first record if you land inside the last.

Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.

In reply to Re^9: Random sampling a variable length file. by bellaire
in thread Random sampling a variable record-length file. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (4)
    As of 2019-11-13 07:13 GMT
    Find Nodes?
      Voting Booth?
      Strict and warnings: which comes first?

      Results (69 votes). Check out past polls.