Re^3: NP-complete sometimes isn't (A benchmark)

in reply to Re^2: NP-complete sometimes isn't (A benchmark)
in thread NP-complete sometimes isn't

Sadly, I got interested in this problem. Although it is nominally O(2**(N-1)), it does seem possible to do quite well for a lot less work -- and, the larger the set, the easier it appears to get !!

So the trick with any decision problem is to prune the search space. tilly observed that although the search space is 2**(N-1), the number of unique nodes may be a lot smaller.

The other trick in the bag is to see if some heuristic can help us prune. As others have observed, the set can be partitioned quite effectively by:

sorting the set into descending order
assigning values in turn to the partition with the smaller sum at the time

This 'hack' works wonderfully for sets containing numbers drawn from a range 1..n and pretty well for m..n -- and, the bigger the set, the better it works !

(I tested this against putting values into one partition until it just exceeds the perfect split -- starting with the set sorted in descending order and with random selection from the set. I even tried putting values into one partition until it just doesn't exceed the perfect split, and then finding the best remaining value. No better.)

Looking at the result of the initial 'hack', one can see numbers that could be swapped between partitions to improve things. This 'refinement' is a linear scan down the two partitions -- a bit worse than linear if numbers can be exchanged.

I tried these heuristics on 2000 sets of numbers drawn randomly from 1..99, 2000 sets drawn from 1..9999 and 100 from 9000..9999; for sets of length 31, 100 and 500:

  :            : 2000 x 1..99      : 2000 x 1..9999    :  100 x 9000..9999 :
  :            : perf.%  av. delta : perf.%  av. delta : perf.%  av. delta :
  :------------:-------------------:-------------------:--------------------
  :  31: hack  :  44.4%     +1.117 :   3.0%   +131.786 :   0.0%  +4500.720 :
  :      refine:  97.7%     +0.024 :   3.5%    +25.639 :   0.0%  +1435.640 :
  :      best  : 100.0%     +0.000 : 100.0%     +0.000 :   0.0%  +1015.820 :
  :------------:-------------------:-------------------:--------------------
  : 100: hack  :  84.5%     +0.174 :   3.0%    +38.562 :  13.0%     +4.640 :
  :      refine: 100.0%     +0.000 :  30.8%     +2.292 :  95.0%     +0.050 :
  :      best  : 100.0%     +0.000 : 100.0%     +0.000 : 100.0%     +0.000 :
  :------------:-------------------:-------------------:--------------------
  : 500: hack  : 100.0%     +0.000 :   9.7%     +7.560 :  50.0%     +0.750 :
  :      refine: 100.0%     +0.000 :  99.9%     +0.001 : 100.0%     +0.000 :
  :      best  : 100.0%     +0.000 : 100.0%     +0.000 : 100.0%     +0.000 :

where the perf% is the percentage of perfectly partitioned sets after the given operation, and the av. delta is the average distance from perfect, over all sets. The 'best' line shows the best possible outcome.

Note that the longer the set:

the more likely there is to be a perfect partition,
and the more likely the 'hack' and 'refine' operations are to finding one !!

Having got that far, how does one search for the best possible partition ? Looking at the problem as a tree walk, what I tried was:

start at the node identified by the initial 'hack' & 'refine', and work back up the tree (making sure that all nodes are considered exactly once).
discard parts of the tree that whose minimum result is greater than the best so far, or whose maximum result is less than the best so far -- minimax.
discard parts of the tree which have already been visited with the same partition total -- similar to the trick used in 708384
build a table for all combinations of the last 'n' numbers in the set, which short circuits the search of the lowest levels of the tree. Building the table is not free, so it's built progressively, when there appears to be a need.
limit the search, so that doesn't disappear for ever !

This turned out to quite tricky for a bear-of-little-brain, and, with all the debug and test stuff, runs to 2800 lines.

Anyway, I ran this on the same test data, and collected the average running time per partition operation:

  :     : 2000 x 1..99      : 2000 x 1..9999    :  100 x 9000..9999 :
  :-----:-------------------:-------------------:--------------------
  :  31 :  GMCH 127.20 uSec :  GMCH   4.26 mSec :  GMCH 615.67 mSec :
  :     : tilly  32.18 mSec : tilly   2.55 Secs : tilly   6.34 Secs :
  :-----:-------------------:-------------------:-------------------:
  : 100 :  GMCH 298.18 uSec :  GMCH   9.59 mSec :  GMCH 733.82 uSec :
  :     : tilly 606.84 mSec : tilly  65.82 Secs : tilly 109.46 mSec :
  :-----:-------------------:-------------------:-------------------:
  : 500 :  GMCH   1.76 mSec :  GMCH   1.87 mSec :  GMCH   1.84 mSec :
  :     : tilly  19.89 Secs : tilly  *****      : tilly  *****      :

where 'GMCH' is my code, 'tilly' is the code given in 708384, uSec is micro-Secs and mSec is milli-secs. The '*****' is where I terminated the test, because it had run out of memory and started to thrash.

The code is available here http://www.highwayman.com/perls/part_GMCH_v1.14.pm I can post it here if y'all like.

What has surprised me is that this works as well as it does, particularly as the set gets bigger ! I wonder if I just haven't found the right test data :-( I'd be pleased to hear of stuff that this won't cope with.

For completeness, I'm only handling +ve values. To cope with -ve values the minimax and other pruning has to be modified to do different things depending on which side of the perfect partition one is on at any time.

In Section Meditations