davis has asked for the
wisdom of the Perl Monks concerning the following question:
My.... it's been a long time
I have a problem that I suspect has been well answered by algorithms, but the trouble is I don't have a name to put to it to search for the "best" algorithm. I suspect someone with a computer science background might Just Know(tm).
Simply put, I have a list of ~100 objects that differ wildly in sizes (they're actually subdirectories). I want to sort them into different bins (which represent the tape drives I intend to use to back them up). I've come up with the following naïve code which works, but it breaks down a bit when the directory sizes vary more.
So, is this an NPhard problem? Does it have a name? Is there a "better" way to solve it? I've tried the rather obvious sorting by size first.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my %bins = (
A => 0,
B => 0,
C => 0,
D => 0,
);
foreach(<DATA>) {
my $line = $_;
chomp $line;
my ($size, $name) = split /\s+/, $line;
my $least_full_bin = "A";
foreach my $bin (keys %bins) {
$least_full_bin = $bin if($bins{$bin} < $bins{$least_full_bin}
+)
}
$bins{$least_full_bin} += $size;
}
print Dumper(\%bins);
__DATA__
2548780 foobarbaz
16913085 foobarbaz
1898507 foobarbaz
67964007 foobarbaz
50239619 foobarbaz
90904439 foobarbaz
91859405 foobarbaz
443250331 foobarbaz
271426911 foobarbaz
466223129 foobarbaz
1020605272 foobarbaz
576517351 foobarbaz
362218227 foobarbaz
597801541 foobarbaz
396406322 foobarbaz
493933490 foobarbaz
695851994 foobarbaz
1093525924 foobarbaz
3860084060 foobarbaz
1815001219 foobarbaz
1242364290 foobarbaz
167867012 foobarbaz
5201589904 foobarbaz
4229781128 foobarbaz
1805056667 foobarbaz
5139655485 foobarbaz
5202726574 foobarbaz
947237511 foobarbaz
210571932 foobarbaz
51518 foobarbaz
2737944 foobarbaz
2686687 foobarbaz
0 foobarbaz
3 foobarbaz
1 foobarbaz
1 foobarbaz
3 foobarbaz
3 foobarbaz
0 foobarbaz
0 foobarbaz
3848 foobarbaz
1 foobarbaz
6584 foobarbaz
8879 foobarbaz
1 foobarbaz
1 foobarbaz
2031703999 foobarbaz
25614 foobarbaz
1 foobarbaz
686994 foobarbaz
2 foobarbaz
32703 foobarbaz
135867 foobarbaz
7 foobarbaz
9 foobarbaz
3148830 foobarbaz
6803655 foobarbaz
147343006 foobarbaz
57013 foobarbaz
22 foobarbaz
655127653 foobarbaz
2034016 foobarbaz
1323038 foobarbaz
26 foobarbaz
9309654 foobarbaz
11 foobarbaz
472 foobarbaz
70134007 foobarbaz
1 foobarbaz
669792 foobarbaz
296399 foobarbaz
0 foobarbaz
37 foobarbaz
1775825 foobarbaz
6499972 foobarbaz
11766 foobarbaz
1392 foobarbaz
6344 foobarbaz
10290402 foobarbaz
8529 foobarbaz
0 foobarbaz
28232823 foobarbaz
796 foobarbaz
Re: Computer science Problem  single dimension bin packing by ikegami (Pope) on Aug 14, 2014 at 16:14 UTC 
You actually used the name of the problem in the subject: Bin packing problem. Yes, it is NPhard.  [reply] 
Re: Computer science Problem  single dimension bin packing by AppleFritter (Priest) on Aug 14, 2014 at 16:11 UTC 
This sounds like the knapsack problem problem to me, which is indeed NPhard. If you search CPAN for "knapsack", there's a few modules that may be of help.
That said, for many of these problems, it's only finding an optimal solution that's NPhard (or worse). Finding a good solution may be much easier (speaking in terms of computational complexity), depending of course on the precise constraints on what constitutes a "good" solution.
 [reply] 

The knapsack problem is different. You have a single bin, you are trying to maximize the total value of the items you place in the bin, each item having a size and value.
 [reply] 

Thanks, I forgot to mention (sorry) I'd already looked at the knapsack problem... but I didn't see any mention of "multiple knapsacks". Frankly, though, that page is rather impenetrable to me because of phrases like "Dominance relations"...
 [reply] 

The Wikipedia article has a section on that:
Multiple knapsack problem
This variation is similar to the Bin Packing Problem. It differs from the Bin Packing Problem in that a subset of items can be selected, whereas, in the Bin Packing Problem, all items have to be packed to certain bins. The concept is that there are multiple knapsacks. This may seem like a trivial change, but it is not equivalent to adding to the capacity of the initial knapsack. This variation is used in many loading and scheduling problems in Operations Research and has a PTAS.
I'm afraid I'm not familiar with this variation, but having a name to search for should help, I reckon.
 [reply] 



D'oh
Although I still don't know if that's actually an answer!.
 [reply] 
Re: Computer science Problem  single dimension bin packing by kennethk (Abbot) on Aug 14, 2014 at 16:25 UTC 
What are you trying to optimize? Are you trying to get an even distribution of data across a fixed number of drives? Efficiently fill drives of fixed size? The latter is (as AppleFritter points out) the knapsack problem. For case 1), a simple and quite effective solution is to go from largest object to smallest, putting it in the least filled bin.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my %bins = (
A => 0,
B => 0,
C => 0,
D => 0,
);
my %dir;
my $int;
while(<DATA>) {
chomp;
my ($size, $name) = split /\s+/;
$name .= ++$int if $dir{$name}; # Disabiguation; all your names ar
+e identical
$dir{$name} = $size;
}
for my $name (sort {$dir{$b} <=> $dir{$a}} keys %dir) {
my ($smallest) = sort {$bins{$a} <=> $bins{$b}} keys %bins;
$bins{$smallest} += $dir{$name};
}
print Dumper(\%bins);
output:
$VAR1 = {
'A' => '9885811019',
'D' => '9885704251',
'C' => '9885960464',
'B' => '9885704533'
};
For case 2), you would put it in the smallest available space in which it fits.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $max = 10**10;
my $new_bin = 'A';
my %bins;
my %dir;
my $int;
while(<DATA>) {
chomp;
my ($size, $name) = split /\s+/;
$name .= ++$int if $dir{$name}; # Disabiguation; all your names ar
+e identical
$dir{$name} = $size;
}
DIR_LOOP: for my $name (sort {$dir{$b} <=> $dir{$a}} keys %dir) {
for my $bin_name (sort {$bins{$b} <=> $bins{$a}} keys %bins) {
if ($bins{$bin_name} + $dir{$name} < $max) {
$bins{$bin_name} += $dir{$name};
next DIR_LOOP;
}
}
$bins{$new_bin++} = $dir{$name};
}
print Dumper(\%bins);
outputs
$VAR1 = {
'A' => '9999831421',
'D' => '9543689031',
'C' => '9999660101',
'B' => '9999999714'
};
Update: Added code
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
 [reply] [d/l] [select] 

What are you trying to optimize?
Neither. It's one of
 If he has an unlimited number of tapes: Minimize the number of tapes needed (bin packing problem).
 If he has an limited number of tapes: Maximize the value of what will fit on the tapes he has (multiple knapsack problem).
Efficiently fill drives of fixed size? The latter is (as AppleFritter points out) the knapsack problem.
No, it's not, as I already pointed out. In the knapsack problem, you have one bin (tape), and you're trying the maximize what you can put on it. Stuff is left behind.
 [reply] 

Stuff is left behind.
Then add another bin (tape) and start filling it with leftovers. Repeat until no more bins (tapes) or leftovers.
The first few bins would probably be most optimally filled for most datasets but if the goal is to use the least amount of tapes this should work... or am I missing something?
 FloydATC
Time flies when you don't know what you're doing
 [reply] 



 [reply] 

Hi kennethk. First of all, sorry for not stating my intention concisely. I thought I had, but on rereading I realise it's not clear.
However, as you correctly inferred, I'd like to (approximately) evenly distribute the data across a fixed number of drives (4, in this case). It looks like your first chunk of code does this, and produces far better results than my first attempt at placing the largest items first.
Embarrassingly, I've just looked at my code (which I didn't post) and realised that I in fact sorted in the wrong order, and it placed the smallest item first. The second lesson I should learn is that I should at least include the code that was faulty, as someone would've spotted it instantly. In any case I think I prefer your code slightly to my (now working) code, so I'll use that.
Thanks everyone (and sorry), as always some interesting responses, although I'm so far behind on the CS terms!
 [reply] 

 [reply] 
Re: Computer science Problem  single dimension bin packing ("First Fit Decreasing") by LanX (Canon) on Aug 14, 2014 at 17:33 UTC 
Hi Davis
You didn't tell us what exactly needs to be optimized.
NP hardness depends largely on this question.
FWIW have a look at "First Fit Decreasing" algorithm, it has a complexity of
n * log n and guarantees a solution not worse than 3/2 of the optimum of the "Bin Packing Problem" (even 11/9 asymptotically)
This should be good enough in most cases.
BTW it's not "Knapsack" since you are not trying to maximize any second weight function of the packed items. They have a size but no additional cost value.
HTH :)
Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)}
 [reply] 
Re: Computer science Problem  single dimension bin packing by Bethany (Scribe) on Aug 15, 2014 at 01:40 UTC 
Might help, might not: The last time I had to do a closely related chore I ended up using a technique called "simulated annealing" to find goodbutnotabsolutelyoptimal solutions. The problem is it was decades ago using an allbutextinct programming language,* so I can't provide more than the name. Googling "simulated annealing" turns up articles that look relevant but wouldn't provide anything like a readytouse solution.
* Since someone's bound to ask, it was APL.
 [reply] 

 [reply] 

 [reply] 
Re: Computer science Problem  single dimension bin packing by duelafn (Priest) on Aug 15, 2014 at 13:24 UTC 
Yes, this is bin packing. There is a module that will do it on CPAN: Algorithm::BinPack. I've also written some code which does this back when I was teaching a class on the topic and needed to be able to compare the various strategies. Generally, if you want the best packing, you just have to try the various strategies and choose the one which makes the most efficient use of space. For any of the strategies, one can construct a scenario where another of the strategies performs better. But, as LanX said, First Fit Decreasing is a usually strong performer.
 [reply] 
Re: Computer science Problem  single dimension bin packing by sundialsvc4 (Abbot) on Aug 14, 2014 at 17:33 UTC 
Quite honestly, I would duckanddodge the problem since there is probably no measurable advantage to an “optimal solution.” I’d say, just choose at random any bin that is bigenough to hold the next object. Any gardenvariety random number generator is designed to provide a fairlyeven distribution of generated values, so the odds are quite good that the actual distribution obtained for any particular backup will be favorable.
You could also, of course, add a second “backfill” pass to the algorithm, which examined the initial random result to see if it is obviouslyscrewy, as in a particular bucket being (say ...) “more than somen standard deviations away from the mean,” in which case you might, for example, make a random number of attempts to successfully “steal” a randomlyselected item (that will fit ...) from a randomlychosen bucket in order to drop it into the bucket(s) that are comingup “short.”
A little bit of simulation should help you determine whether your chosen compromisealgorithm is “good enough for peace work.”
 [reply] 

