Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Bag uniform distribution algorithms

by davido (Archbishop)
on Apr 25, 2013 at 15:01 UTC ( #1030688=perlquestion: print w/ replies, xml ) Need Help??
davido has asked for the wisdom of the Perl Monks concerning the following question:

Consider an arbitrary number of lists of arbitrary size:

qw( A A A A ) qw( B B ) qw( C C C ) qw( D )

As a "bag" this could be expressed as { A => 4, B => 2, C => 3, D => 1 }.

The total number of elements from all lists (or all bag elements) is ten (but the general problem could have any result-set size, based on the inputs). Zip (or distribute) all bags into a single list where each bag is as uniformly distributed as possible. Many problems will have more than one solution, but any single "as optimal as possible" solution is adequate. The above bags could distribute like this:

qw( A C B A D C A B C A ) # I think...

One solution I considered was to convert each list into an iterator that spits out its contents at frequencies separated by undef, then zip the iterators. Each iterator would need to know its priority (its relative list size), and use that priority to pad the starting point with 'undef'. Something like this:

# Sublists generated by an iterator. my @p1 = ( 'A' , undef, undef, 'A', undef, undef, 'A', undef, und +ef, 'A' ); my @p2 = ( undef, 'C', undef, undef, undef, 'C', undef, undef, und +ef, 'C' ); my @p3 = ( undef, undef, 'B', undef, undef, undef, undef, 'B', und +ef, undef ); my @p4 = ( undef, undef, undef, 'D', undef, undef, undef, undef, und +ef, undef ); # Now zip them up. my @solution = grep { defined } zip @p1, @p2, @p3, @p4; print "@solution\n"; # Produces A C B A D C A B A C

This isn't exactly the same as the sample output I sought, but I think it's an equally optimal output, so it should be fine. But one problem is that we would need to sort the bags by size so that the largest always picks first, and so on. Why am I hooked on the idea of an iterator generating each sub-list as a frequency interspersed with undef? ...because it seems in my mind to be a more general approach, allowing for the possibility of infinite streams, possibly useful for task scheduling by priority.

So what is my question? Several:

  • What class of problem is this? I thought maybe bin packing, but the bins are size one, and bin packing doesn't concern itself with uniformity of frequency.
  • What algorithm would produce one of several optimal solutions in the best computational complexity and space?
  • Is there a generalized solution that minimizes error if the closed list is converted to an infinite list? (Ie, instead of producing a result set of ten elements, produce a stream of uniform distribution based on the frequencies)
  • What would a Perlish solution actually look like?

There isn't anything mission critical; I just recently was reading a discussion and started thinking... we're satisfying curiosity here. Last night I spent some time reading over the docs for List::Gen, which may actually be useful as it facilitates lazy generation, but I couldn't quite get my head around a List::Gen approach.


Dave

Comment on Bag uniform distribution algorithms
Select or Download Code
Re: Bag uniform distribution algorithms
by BrowserUk (Pope) on Apr 25, 2013 at 16:45 UTC

    You could try something like this:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; sub gen { my $href = shift; my @kByV = sort{ $href->{ $b } <=> $href->{ $a } } keys %$href; my @dist = ( $kByV[0] ) x $href->{ $kByV[0] }; shift @kByV; while( @kByV ) { my $k = shift @kByV; my $v = $href->{ $k }; my $n = int( @dist / ( $v+1 ) ); my $p = $n * $v; splice( @dist, $p, 0, $k ), $p -= $n for reverse 1 .. $v; } return \@dist; } pp gen( { A => 4, B => 2, C => 3, D => 1 } ); pp gen( { A => 5, B => 4, C => 3, D => 2, E => 1 } );

    Outputs:

    C:\test>1030688.pl ["A", "C", "B", "A", "D", "C", "B", "A", "C", "A"] ["A", "B", "C", "A", "D", "B", "C", "E", "A", "B", "D", "C", "A", "B", + "A"]

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bag uniform distribution algorithms
by kennethk (Monsignor) on Apr 25, 2013 at 20:41 UTC
    After mulling on this one for a while, in particular thinking about how to track when a letter was due to get output again and how to maintain a distribution, I think I have something good. Compute a rolling score for each element, where it accumulates based upon probability of output and gets dinged when its element gets selected; maximum score always gets chosen.
    use strict; use warnings; use List::Util 'sum'; use Data::Dump qw[ pp ]; sub gen { my $href = shift; my %score = %$href; my $norm = sum values %score; my @series; for (1 .. $norm) { my ($max, $elem) = 0; for (keys %score) { ($max, $elem) = ($score{$_}, $_) if $score{$_} >= $max; $score{$_} += $href->{$_}; } push @series, $elem; $score{$elem} -= $norm; } return \@series; } pp gen( { A => 4, B => 2, C => 3, D => 1 } ); pp gen( { A => 5, B => 4, C => 3, D => 2, E => 1 } );
    outputs
    ["A", "C", "B", "A", "D", "C", "A", "B", "C", "A"] ["A" .. "D", "A", "B", "E", "A", "C", "B", "A", "D", "C", "B", "A"]

    It also naturally extends to infinite series and non-integer element counts. The infinite series case is why it's necessary to have a >= in the score comparison; you could alternatively initialize $max to something negative since the sum of all scores is necessarily zero at all times. N*M complexity, where N is the length of desired series and M is number of distinct elements.

    This approach also lends itself to designing a metric for how smooth a series is:

    use strict; use warnings; use List::Util 'sum'; sub measure { my @series = @_; my %count; $count{$_}++ for @series; my $norm = sum values %count; $_ /= $norm for values %count; my %score = %count; my $metric = 0; for my $elem (@series) { $score{$elem} -= 1; for (keys %score) { $metric += $score{$_}**2; $score{$_} += $count{$_}; } } return $metric/@series; } print measure("A", "C", "B", "A", "D", "C", "A", "B", "C", "A"), "\n"; print measure("A", "C", "B", "A", "C", "D", "A", "B", "C", "A"), "\n"; print measure("A", "A", "A", "A", "B", "B", "C", "C", "C", "D"), "\n";
    outputs
    0.37 0.37 3.45

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Bag uniform distribution algorithms
by BrowserUk (Pope) on Apr 25, 2013 at 21:17 UTC
    Is there a generalized solution that minimizes error if the closed list is converted to an infinite list?

    Given the nature of the input, how are you seeking to convert that to a specification of an infinite list?

    What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list.

    Using your example input, until the list reaches a length of 10, adding an 'e' will mean that 'e's are over represented; but waiting until the 10th take in order to add the 'e's, means that if the list stops there, the 'e's aren't "uniformly distributed". At least in as much as your post implies uniform distribution whereby intuitively, a single letter, should appear somewhere close to the middle of the list. There is no way to maintain that definition of "uniform distribution" whilst generating a list one element at a time. (Not even if you knew the final target length up front.) You would -- and, at best, could only -- achieve that definition of uniform distribution every mod(M: where M == sum(f0n)) elements.

    If that is acceptable, you might generate a single natural length, uniformly distributed list internally, and then return that one element at a time, cyclically. The distribution will only be perfect every M takes, but it will never be grossly wrong, which meets the "minimizes error" requirement.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Given the nature of the input, how are you seeking to convert that to a specification of an infinite list?

      What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list.

      This is also what came to my mind when I read the specification.

      The problem is somewhat similar to data compressing algorithms, which often work on complete files and thus can make full statistical analysis of the data before starting to really encode, and others which have to work on the fly with data coming on a network, for example.

      I guess one way to do that is to use a sliding window mechanism, i.e. you reorganize data within a sliding window of a certain size; but whatever is no longer in the sliding window can no longer be optimized with the new data coming in. Of course, the final result is usually not as good as if the full data had been there from the onset, but you can still manage a heuristics to make things relatively close to optimal (i.e. relatively similar to what a perfect algorithm would have done with a prior knowledge of the full data set). But, of course, this can work on most usual cases, but it is also probably possible to manufacture a deviant data set where this heuristics would fail to produce good results (just as, given a compressing algorithm, it is almost always possible to produce data where the compressed result will take more place that the original one, unless of course the algorithm as an "oops, back to the original data" clause). And, of course, the size of the Window might have a considerable effect on the degree of successfulness of the heuristics. I guess that only actual test with real data can say this, it does not look as if a formal analysis can answer this question, unless possibly if we have an in-depth knowledge of the data coming in.

      In short, yes. It's incumbent on the "user" to either know that if there are 20 items being distributed, a fair distribution can only occur at ( $n % 20 ) == 0 or to be Ok with modulo bias. And Likewise, in the case of an infinite stream, the user should either draw multiples of the size of the input lists, or be ok with the fact that as $n approaches infinity modulo bias fades into irrelevancy.

      I'm also assuming that the input lists are finite in size, so the frequency can be known.


      Dave

Re: Bag uniform distribution algorithms
by Laurent_R (Vicar) on Apr 25, 2013 at 21:23 UTC

    What class of problem is this? I thought maybe bin packing, but the bins are size one, and bin packing doesn't concern itself with uniformity of frequency.

    Maybe it could be classified as a special case of the "partition problem" or "treasure partitioning problem". Although there are some significant differences with the classical description of the partition problem.

Re: Bag uniform distribution algorithms
by kcott (Abbot) on Apr 26, 2013 at 09:02 UTC

    G'day Dave,

    By alternatively splicing elements from either end of the distribution, I get your original 'qw( A C B A D C A B C A )'.

    $ perl -Mstrict -Mwarnings -E ' my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @distribution; for my $key (sort { $bag{$b} <=> $bag{$a} } keys %bag) { my $base_offset = int(@distribution / ($bag{$key} + 1)); my $offset = $base_offset; for (1 .. $bag{$key}) { next unless $_ % 2; splice @distribution, $offset, 0, $key; if ($_ < $bag{$key}) { splice @distribution, -$offset, 0, $key; } $offset += $base_offset + 1; } } say "@distribution"; ' A C B A D C A B C A

    If I round up the $base_offset value, the middle two elements are reversed but everything else remains the same. I don't see this as being more or less uniform but maybe it's more correct.

    $ perl -Mstrict -Mwarnings -E ' my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @distribution; for my $key (sort { $bag{$b} <=> $bag{$a} } keys %bag) { my $base_offset = int(@distribution / ($bag{$key} + 1) + 0.5); + my $offset = $base_offset; for (1 .. $bag{$key}) { next unless $_ % 2; splice @distribution, $offset, 0, $key; if ($_ < $bag{$key}) { splice @distribution, -$offset, 0, $key; } $offset += $base_offset + 1; } } say "@distribution"; ' A C B A C D A B C A

    I haven't done extensive testing on this solution. Beyond your sample input, I tried:

    my %bag = ( A => 4, B => 2, C => 3, D => 1, Z => 1 );

    gives: A C B A C D Z A B C A; and

    my %bag = ( A => 4, B => 2, C => 3, D => 1, Y => 5, Z => 1 );

    gives: Y A C Y B A C Z D Y A B Y C A Y; and

    my %bag = ( A => 4, B => 2, C => 3, D => 1, X => 6, Y => 5, Z => 1 );

    gives: X Y A X C Y B A X Y Z D C X A B Y C X A Y X; and

    my %bag = ( A => 4, B => 2, C => 3, D => 1, W => 4, X => 6, Y => 5, Z +=> 1 );

    gives: X Y A W X C Y B A W X Y Z D C X W A B Y C X W A Y X

    The above 4 tests all included rounding; without rounding, the results are:

    A C B A Z D C A B C A Y A C Y B A C D Z Y A B Y C A Y X Y A C X Y B A C X D Z Y X A B Y X C A Y X X Y A W C X Y B A W C X D Z Y X W A B Y X C W A Y X

    -- Ken

Re: Bag uniform distribution algorithms
by hdb (Parson) on Apr 26, 2013 at 09:16 UTC

    Evaluation is key for this question. In addition to kennethk's score function I think the cumulative variance of positions is an appropriate measure. For the variance, a large value is good.

    use strict; use warnings; use Math::Round; use Statistics::Basic qw(:all); use List::Util 'sum'; sub dispersion { my %p; my $pos = 0; push @{$p{$_}}, $pos++ while $_ = shift; my $var = 0; $var += variance( $p{$_} ) for ( keys %p ); return $var; } sub measure { my @series = @_; my %count; $count{$_}++ for @series; my $norm = sum values %count; $_ /= $norm for values %count; my %score = %count; my $metric = 0; for my $elem (@series) { $score{$elem} -= 1; for (keys %score) { $metric += $score{$_}**2; $score{$_} += $count{$_}; } } return $metric/@series; }
    sub evaluate { my $cand = shift; my $bags = shift; printf "\n%20s %15s %15s %s\n", "Monk", "Better:higher", "Bett +er:lower", "Result"; print "------------------------------------------------------- +-----\n\n"; for my $bag (@$bags) { print "bag: "; print join ", ", map { +"$_ => ".$bag->{$_} } sort key +s %$bag ; print "\n\n"; for my $monk ( sort keys %$cand ) { my $result = $cand->{$monk}->($bag); my $disp = dispersion( @$result ); my $score = measure( @$result ); printf "%20s %15.2f %15.2f %s\n", $monk, $disp +, $score, join "", @$result; } print "\n"; } } my $bags = [ { A => 4, B => 2, C => 3, D => 1, }, { A => 4, B => 2, C => 3, D => 1, F => 2, G => 4, H=> 2, I=>30 + }, { A => 10, B=> 10, C => 10, }, ]; my $candidates = { # sub expects hash ref and returns array ref kennethk => \&gen, BrowserUK => \&genUK, kcott => \&kcott, hdb => \&mix, }; evaluate( $candidates, $bags );

    which gives the following output

Re: Bag uniform distribution algorithms
by LanX (Canon) on Oct 27, 2013 at 21:38 UTC
    I'm not sure how "as uniformly distributed as possible" can be qualified ...

    Do you have a test-code to check the "quality" of a solution?

    But sorting according to a weighting function gives similar results like shown by you.

    DB<171> %h=( A => 4, B => 2, C => 3, D => 1 ) => ("A", 4, "B", 2, "C", 3, "D", 1) DB<172> $sum=0; $sum+=$_ for values %h; => "" DB<173> @list=map { my ($k,$v)=($_,$h{$_}); my $int=$sum/$v; map { + [ $k => $int*($_-.5)] } 1..$v } keys %h => ( ["A", "1.25"], ["A", "3.75"], ["A", "6.25"], ["A", "8.75"], ["D", 5], ["C", "1.66666666666667"], ["C", 5], ["C", "8.33333333333333"], ["B", "2.5"], ["B", "7.5"], ) DB<174> map {$_->[0]} sort { $a->[1] <=> $b->[1] or $a->[0] cmp $b-> +[0]} @list => ("A", "C", "B", "A", "C", "D", "A", "B", "C", "A")

    Changing the weighting function would also allow to repeat the pattern in a way that joined sequences are still equally distributed ( that is A doesn't neighbor A )

    An iterator-version shouldn't be too difficult.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    update

    code simplified

    update

    well after second thought it's quite easy to find input where this approach fails ... never mind! :(

Re: Bag uniform distribution algorithms
by LanX (Canon) on Oct 28, 2013 at 17:09 UTC
    I'd say this is an (discrete) optimization problem, which kind depends on the distance function to minimize (i.e. distance from idealized optimal solution)

    One possibility is to measure the integer distances between elements of one bag and to sum up the deltas to the idealized real number distance. You can also sum up the quadratic deltas and take the square root (Norm_(mathematics)#Euclidean_norm). Which "norm" to take depends on your intuitive understanding of "uniformity".

    I think you can get very good results with heuristic approaches involving some random elements, but w/o guaranty of being optimal.

    Maybe of interest, the following algorithm will calculate all >12000 combinations of your bags, you can use the output to test different distance functions (or norms) to refine your understanding of "uniformely distributed".

    Please note the flag '$MODULO_ROTATION' which allows to limit to the subset of solutions which can generate all other solutions by rotating the bytes, this might facilitate calculation of the distance.

    use v5.10.0; use warnings; use strict; use Data::Dump qw/pp/; #my @sets = ( ["A".."C"], [("-") x 3] ); my (@path,@results); my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @sets = (); my $MODULO_ROTATION = 1; if ($MODULO_ROTATION ){ delete $bag{D}; push @path,"D"; } push @sets, [ ($_) x $bag{$_} ] for keys %bag; sub branch { my $done=1; for my $set (@sets){ if (@$set) { $done=0; push @path, shift @$set; branch(); unshift @$set, pop @path; } } if ($done){ push @results, join "",@path; } } branch(); pp \@results;

    Of course you could already combine this slow branching approach with a distance function which avoids walking thru inefficient sub-tree for a branch-and-bound solution... (i.e. bound if the distance so far already exceeds the known local minimum)

    But I doubt you would want to use this in praxis...

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      This is a nice path to follow. Thanks! :)


      Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1030688]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-07-28 23:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls