http://www.perlmonks.org?node_id=814474


in reply to Re^6: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

You could try a generative approach to answering this question. Generate random data, use your method to sample the data. Then scrutinize your samples to see whether they are, in fact, randomly selected, or if there appears to be a bias.

My intuition wants to say that if there is no correlation between the lengths of adjacent records, then it doesn't matter that you are selecting records that follow long records preferentially, because following long records doesn't correlate with anything. Put another way, if all of your records have an equal chance of following a long record (or more generally, any other particular record), then the sampling method is as valid as any other.