comment on

Recently Pepe asked an interesting question at Divide array of integers into most similar value halves. Given an array of integers he wanted to split it into two sub-arrays whose sum was as close as possible together. Many people informed him that this was a very hard problem, references were provided showing that this solves the Partition Problem, which is known to be NP-complete. Luckily Pepe did not need an exact answer, and there is a greedy algorithm that was good enough for his needs.

What makes this problem interesting is that the many confident answers and external references notwithstanding, the actual problem the OP has is almost certainly not NP-complete! As proof I have an implementation on my computer that reliably partitions an array of 100 random integers of size 0-999 in under 7 seconds. That is, the the following code reliably runs in under 7 seconds and is guaranteed to find the best answer:

my ($x, $y)
  = find_best_partition(map int(rand(1000)), 1..100);
[download]

How is this possible?

The trick is that when it comes to NP-complete problems, the devil is in the details. If you change even one detail of the problem, what was NP-complete can suddenly become much more tractable. Sometimes the detail is so easy to miss as to be virtually impossible to see. Such is the case here.

The relevant detail in this case is that we are dealing with an array of small integers. If we have 100 integers in the range 0-1000 in side and look at the partitions, the difference of the size of the two partitions has to be in the range -100,000 to 100,000. Of course there are 2¹⁰⁰ possible partitions, but the same difference in sizes will show up from many, many possible partitions. But we don't care about enumerating the possible partitions, only the possible sizes. And while a range of 200,000 possible sizes is not exactly small, it is still tractable.

With arbitrary integers this idea would fail hard, because nothing stops you from having integers of size 2¹⁰⁰. But in the real world when we say "integer", we usually don't actually mean "arbitrary integer". We mean "small integer" and that difference can be important.

In this case it looks like we want to create a hash of possible differences in the sum of the partitions. The point of a hash will find and eliminate unneeded duplicate ways of getting the same difference, keeping the problem down to a reasonable size. A back of the envelope estimate says that if we do it right, performance should be O(n²) in the size of the initial dataset. (A darned sight better than the naive 2^O(n)!) After handling a few technical details, that leads to the following naive implementation that reliably runs on my laptop in about 20 seconds:

sub find_best_partition {
  # We're going to try to find partitions that add up to each possible
  # number that can be added up to.
  my $old;
  my $new = {0 => [[], []]};

  for my $n (sort {$a <=> $b} @_) {
    $old = $new;
    $new = {};
    while (my ($key, $value) = each %$old) {
      my ($p1, $p2) = @$value;
      $new->{$key + $n} ||= [[$n, $p1], $p2];
      $new->{$key - $n} ||= [$p1, [$n, $p2]];
    }
  }

  my $best = each %$new;
  while (my $difference = each %$new) {
    if (abs($difference) < abs($best)) {
      $best = $difference;
    }
  }

  # We need to flatten our nested arrays.
  my ($p1, $p2) = @{ $new->{$best} };

  my @part_1;
  while (@$p1) {
    push @part_1, $p1->[0];
    $p1 = $p1->[1];
  }

  my @part_2;
  while (@$p2) {
    push @part_2, $p2->[0];
    $p2 = $p2->[1];
  }

  return (\@part_1, \@part_2);
}
[download]

Of course 20 seconds is a little slower than I'd like. So I reasoned that most of my time is spent looking at partitions that I should know are not going to lead to the best answer. So I figured that I should first try to find a greedy solution, and then skip any partial partition which could not possibly match the greedy one when it is filled out. Of course this filtering would be most effective if I put the biggest numbers first, because the sume of a few big numbers at the start could be as big as the sum of many little numbers at the end.

Of course when I implemented this I also noticed that often you did find a perfect partition. So I added a check for whether the starting partition we have plus the rest of the initial greedy partition is a perfect partition. If it is, then stop immediately. In that case you get the right answer virtually instantaneously.

This version often finishes in a couple of hundredth's of a second, and the rest of the time finishes on my laptop in under 7 seconds. The code is kind of long, so I'll hide it.

So the next time you think you have an NP-complete problem, before throwing your hands up in despair, think about whether there is anything, anything at all, that you can possibly use to turn it into a much simpler problem. Usually you will fail, but every so often you'll get lucky. As in this case.

For another example of a case where a problem looked a lot like an NP-complete one, see Puzzle: need a more general algorithm. See Re: Balance columns for the surprisingly quick solution.

In reply to NP-complete sometimes isn't by tilly

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks