http://www.perlmonks.org?node_id=1030734


in reply to Bag uniform distribution algorithms

Is there a generalized solution that minimizes error if the closed list is converted to an infinite list?

Given the nature of the input, how are you seeking to convert that to a specification of an infinite list?

What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list.

Using your example input, until the list reaches a length of 10, adding an 'e' will mean that 'e's are over represented; but waiting until the 10th take in order to add the 'e's, means that if the list stops there, the 'e's aren't "uniformly distributed". At least in as much as your post implies uniform distribution whereby intuitively, a single letter, should appear somewhere close to the middle of the list. There is no way to maintain that definition of "uniform distribution" whilst generating a list one element at a time. (Not even if you knew the final target length up front.) You would -- and, at best, could only -- achieve that definition of uniform distribution every mod(M: where M == sum(f0n)) elements.

If that is acceptable, you might generate a single natural length, uniformly distributed list internally, and then return that one element at a time, cyclically. The distribution will only be perfect every M takes, but it will never be grossly wrong, which meets the "minimizes error" requirement.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Bag uniform distribution algorithms

Replies are listed 'Best First'.
Re^2: Bag uniform distribution algorithms
by Laurent_R (Canon) on Apr 25, 2013 at 22:05 UTC

    Given the nature of the input, how are you seeking to convert that to a specification of an infinite list?

    What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list.

    This is also what came to my mind when I read the specification.

    The problem is somewhat similar to data compressing algorithms, which often work on complete files and thus can make full statistical analysis of the data before starting to really encode, and others which have to work on the fly with data coming on a network, for example.

    I guess one way to do that is to use a sliding window mechanism, i.e. you reorganize data within a sliding window of a certain size; but whatever is no longer in the sliding window can no longer be optimized with the new data coming in. Of course, the final result is usually not as good as if the full data had been there from the onset, but you can still manage a heuristics to make things relatively close to optimal (i.e. relatively similar to what a perfect algorithm would have done with a prior knowledge of the full data set). But, of course, this can work on most usual cases, but it is also probably possible to manufacture a deviant data set where this heuristics would fail to produce good results (just as, given a compressing algorithm, it is almost always possible to produce data where the compressed result will take more place that the original one, unless of course the algorithm as an "oops, back to the original data" clause). And, of course, the size of the Window might have a considerable effect on the degree of successfulness of the heuristics. I guess that only actual test with real data can say this, it does not look as if a formal analysis can answer this question, unless possibly if we have an in-depth knowledge of the data coming in.

Re^2: Bag uniform distribution algorithms
by davido (Cardinal) on Apr 25, 2013 at 22:44 UTC

    In short, yes. It's incumbent on the "user" to either know that if there are 20 items being distributed, a fair distribution can only occur at ( $n % 20 ) == 0 or to be Ok with modulo bias. And Likewise, in the case of an infinite stream, the user should either draw multiples of the size of the input lists, or be ok with the fact that as $n approaches infinity modulo bias fades into irrelevancy.

    I'm also assuming that the input lists are finite in size, so the frequency can be known.


    Dave