davis has asked for the wisdom of the Perl Monks concerning the following question:

My.... it's been a long time

I have a problem that I suspect has been well answered by algorithms, but the trouble is I don't have a name to put to it to search for the "best" algorithm. I suspect someone with a computer science background might Just Know(tm).

Simply put, I have a list of ~100 objects that differ wildly in sizes (they're actually subdirectories). I want to sort them into different bins (which represent the tape drives I intend to use to back them up). I've come up with the following na´ve code which works, but it breaks down a bit when the directory sizes vary more.

So, is this an NP-hard problem? Does it have a name? Is there a "better" way to solve it? I've tried the rather obvious sorting by size first.

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; my %bins = ( A => 0, B => 0, C => 0, D => 0, ); foreach(<DATA>) { my $line = $_; chomp $line; my ($size, $name) = split /\s+/, $line; my $least_full_bin = "A"; foreach my $bin (keys %bins) { $least_full_bin = $bin if($bins{$bin} < $bins{$least_full_bin} +) } $bins{$least_full_bin} += $size; } print Dumper(\%bins); __DATA__ 2548780 foobarbaz 16913085 foobarbaz 1898507 foobarbaz 67964007 foobarbaz 50239619 foobarbaz 90904439 foobarbaz 91859405 foobarbaz 443250331 foobarbaz 271426911 foobarbaz 466223129 foobarbaz 1020605272 foobarbaz 576517351 foobarbaz 362218227 foobarbaz 597801541 foobarbaz 396406322 foobarbaz 493933490 foobarbaz 695851994 foobarbaz 1093525924 foobarbaz 3860084060 foobarbaz 1815001219 foobarbaz 1242364290 foobarbaz 167867012 foobarbaz 5201589904 foobarbaz 4229781128 foobarbaz 1805056667 foobarbaz 5139655485 foobarbaz 5202726574 foobarbaz 947237511 foobarbaz 210571932 foobarbaz 51518 foobarbaz 2737944 foobarbaz 2686687 foobarbaz 0 foobarbaz 3 foobarbaz 1 foobarbaz 1 foobarbaz 3 foobarbaz 3 foobarbaz 0 foobarbaz 0 foobarbaz 3848 foobarbaz 1 foobarbaz 6584 foobarbaz 8879 foobarbaz 1 foobarbaz 1 foobarbaz 2031703999 foobarbaz 25614 foobarbaz 1 foobarbaz 686994 foobarbaz 2 foobarbaz 32703 foobarbaz 135867 foobarbaz 7 foobarbaz 9 foobarbaz 3148830 foobarbaz 6803655 foobarbaz 147343006 foobarbaz 57013 foobarbaz 22 foobarbaz 655127653 foobarbaz 2034016 foobarbaz 1323038 foobarbaz 26 foobarbaz 9309654 foobarbaz 11 foobarbaz 472 foobarbaz 70134007 foobarbaz 1 foobarbaz 669792 foobarbaz 296399 foobarbaz 0 foobarbaz 37 foobarbaz 1775825 foobarbaz 6499972 foobarbaz 11766 foobarbaz 1392 foobarbaz 6344 foobarbaz 10290402 foobarbaz 8529 foobarbaz 0 foobarbaz 28232823 foobarbaz 796 foobarbaz


Replies are listed 'Best First'.
Re: Computer science Problem - single dimension bin packing
by ikegami (Pope) on Aug 14, 2014 at 16:14 UTC
    You actually used the name of the problem in the subject: Bin packing problem. Yes, it is NP-hard.
Re: Computer science Problem - single dimension bin packing
by AppleFritter (Vicar) on Aug 14, 2014 at 16:11 UTC

    This sounds like the knapsack problem problem to me, which is indeed NP-hard. If you search CPAN for "knapsack", there's a few modules that may be of help.

    That said, for many of these problems, it's only finding an optimal solution that's NP-hard (or worse). Finding a good solution may be much easier (speaking in terms of computational complexity), depending of course on the precise constraints on what constitutes a "good" solution.

      The knapsack problem is different. You have a single bin, you are trying to maximize the total value of the items you place in the bin, each item having a size and value.
      Thanks, I forgot to mention (sorry) I'd already looked at the knapsack problem... but I didn't see any mention of "multiple knapsacks". Frankly, though, that page is rather impenetrable to me because of phrases like "Dominance relations"...


        The Wikipedia article has a section on that:

        Multiple knapsack problem

        This variation is similar to the Bin Packing Problem. It differs from the Bin Packing Problem in that a subset of items can be selected, whereas, in the Bin Packing Problem, all items have to be packed to certain bins. The concept is that there are multiple knapsacks. This may seem like a trivial change, but it is not equivalent to adding to the capacity of the initial knapsack. This variation is used in many loading and scheduling problems in Operations Research and has a PTAS.

        I'm afraid I'm not familiar with this variation, but having a name to search for should help, I reckon.

        D'oh Although I still don't know if that's actually an answer!.


Re: Computer science Problem - single dimension bin packing
by kennethk (Abbot) on Aug 14, 2014 at 16:25 UTC

    What are you trying to optimize? Are you trying to get an even distribution of data across a fixed number of drives? Efficiently fill drives of fixed size? The latter is (as AppleFritter points out) the knapsack problem.

    For case 1), a simple and quite effective solution is to go from largest object to smallest, putting it in the least filled bin.

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my %bins = ( A => 0, B => 0, C => 0, D => 0, ); my %dir; my $int; while(<DATA>) { chomp; my ($size, $name) = split /\s+/; $name .= ++$int if $dir{$name}; # Disabiguation; all your names ar +e identical $dir{$name} = $size; } for my $name (sort {$dir{$b} <=> $dir{$a}} keys %dir) { my ($smallest) = sort {$bins{$a} <=> $bins{$b}} keys %bins; $bins{$smallest} += $dir{$name}; } print Dumper(\%bins);
    $VAR1 = { 'A' => '9885811019', 'D' => '9885704251', 'C' => '9885960464', 'B' => '9885704533' };

    For case 2), you would put it in the smallest available space in which it fits.

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $max = 10**10; my $new_bin = 'A'; my %bins; my %dir; my $int; while(<DATA>) { chomp; my ($size, $name) = split /\s+/; $name .= ++$int if $dir{$name}; # Disabiguation; all your names ar +e identical $dir{$name} = $size; } DIR_LOOP: for my $name (sort {$dir{$b} <=> $dir{$a}} keys %dir) { for my $bin_name (sort {$bins{$b} <=> $bins{$a}} keys %bins) { if ($bins{$bin_name} + $dir{$name} < $max) { $bins{$bin_name} += $dir{$name}; next DIR_LOOP; } } $bins{$new_bin++} = $dir{$name}; } print Dumper(\%bins);
    $VAR1 = { 'A' => '9999831421', 'D' => '9543689031', 'C' => '9999660101', 'B' => '9999999714' };

    Update: Added code

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      What are you trying to optimize?

      Neither. It's one of

      • If he has an unlimited number of tapes: Minimize the number of tapes needed (bin packing problem).
      • If he has an limited number of tapes: Maximize the value of what will fit on the tapes he has (multiple knapsack problem).

      Efficiently fill drives of fixed size? The latter is (as AppleFritter points out) the knapsack problem.

      No, it's not, as I already pointed out. In the knapsack problem, you have one bin (tape), and you're trying the maximize what you can put on it. Stuff is left behind.

        Stuff is left behind.

        Then add another bin (tape) and start filling it with leftovers. Repeat until no more bins (tapes) or leftovers.

        The first few bins would probably be most optimally filled for most datasets but if the goal is to use the least amount of tapes this should work... or am I missing something?

        -- FloydATC

        Time flies when you don't know what you're doing

        Except that the original problem description does not include the size of the tapes, which means it doesn't clearly map to either of your suggestions. Thus my attempt to gather more data.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Hi kennethk. First of all, sorry for not stating my intention concisely. I thought I had, but on re-reading I realise it's not clear.
      However, as you correctly inferred, I'd like to (approximately) evenly distribute the data across a fixed number of drives (4, in this case). It looks like your first chunk of code does this, and produces far better results than my first attempt at placing the largest items first.

      Embarrassingly, I've just looked at my code (which I didn't post) and realised that I in fact sorted in the wrong order, and it placed the smallest item first. The second lesson I should learn is that I should at least include the code that was faulty, as someone would've spotted it instantly. In any case I think I prefer your code slightly to my (now working) code, so I'll use that.

      Thanks everyone (and sorry), as always some interesting responses, although I'm so far behind on the CS terms!


        Coincidentally, I got the same result when I first tested my code. Since I knew you'd already done better with random ordering, I found my bug quickly.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Computer science Problem - single dimension bin packing ("First Fit Decreasing")
by LanX (Cardinal) on Aug 14, 2014 at 17:33 UTC
    Hi Davis

    You didn't tell us what exactly needs to be optimized.

    NP hardness depends largely on this question.

    FWIW have a look at "First Fit Decreasing" algorithm, it has a complexity of n * log n and guarantees a solution not worse than 3/2 of the optimum of the "Bin Packing Problem" (even 11/9 asymptotically)

    This should be good enough in most cases.

    BTW it's not "Knapsack" since you are not trying to maximize any second weight function of the packed items. They have a size but no additional cost value.

    HTH :)

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

Re: Computer science Problem - single dimension bin packing
by Bethany (Scribe) on Aug 15, 2014 at 01:40 UTC

    Might help, might not: The last time I had to do a closely related chore I ended up using a technique called "simulated annealing" to find good-but-not-absolutely-optimal solutions. The problem is it was decades ago using an all-but-extinct programming language,* so I can't provide more than the name. Googling "simulated annealing" turns up articles that look relevant but wouldn't provide anything like a ready-to-use solution.

    * Since someone's bound to ask, it was APL.

      Congratulations!   You’re the first person I’ve heard of who did serious-work in APL!   :-D

        Thankew. *curtsey* Last interesting thing I ever did in APL, but it was a doozy.

Re: Computer science Problem - single dimension bin packing
by duelafn (Vicar) on Aug 15, 2014 at 13:24 UTC

    Yes, this is bin packing. There is a module that will do it on CPAN: Algorithm::BinPack. I've also written some code which does this back when I was teaching a class on the topic and needed to be able to compare the various strategies. Generally, if you want the best packing, you just have to try the various strategies and choose the one which makes the most efficient use of space. For any of the strategies, one can construct a scenario where another of the strategies performs better. But, as LanX said, First Fit Decreasing is a usually strong performer.

    Good Day,

Re: Computer science Problem - single dimension bin packing
by sundialsvc4 (Abbot) on Aug 14, 2014 at 17:33 UTC

    Quite honestly, I would duck-and-dodge the problem since there is probably no measurable advantage to an “optimal solution.”   I’d say, just choose at random any bin that is big-enough to hold the next object.   Any garden-variety random number generator is designed to provide a fairly-even distribution of generated values, so the odds are quite good that the actual distribution obtained for any particular backup will be favorable.

    You could also, of course, add a second “back-fill” pass to the algorithm, which examined the initial random result to see if it is obviously-screwy, as in a particular bucket being (say ...) “more than some-n standard deviations away from the mean,” in which case you might, for example, make a random number of attempts to successfully “steal” a randomly-selected item (that will fit ...) from a randomly-chosen bucket in order to drop it into the bucket(s) that are coming-up “short.”

    A little bit of simulation should help you determine whether your chosen compromise-algorithm is “good enough for peace work.”