|Problems? Is your data what you think it is?|
Re: Powerset short-circuit optimizationby jimt (Chaplain)
|on Oct 25, 2006 at 15:15 UTC||Need Help??|
I'm coming in a bit late to the party, but I think I've come up with an elegant and fast solution. I was tinkering around with the w() function from Weird number generator to see about different approaches. (that function, in case you all are curious will determine for a set of natural numbers N and a sum S, is there any subset of numbers in N that will add up to S). It's highly recursive and extremely golfed and really powerful, so be careful if you try to read it.
Anyway, my original approach to it was to generate powersets and sum them, but that gets out of control ridiculously fast with memory requirements. Yesterday, Limbic~Region msg'ed me and said he may monkey around with optimizing it, and I, hypercompetitive asshole that I am, started investigating it as well. While in process, I came up with a nifty powerset generator (that I'll probably put over in snippets), and modified it to solve this problem here.
Constraints on my version - it's all bitwise and binary, so it only handles sets up to 32 elements (or 64 if you're lucky), so it would require modification to handle larger sets. Also, it doesn't try to guess in advance which powersets it should and should not produce. It's based upon an iterator that generates sets as it goes along. If a set matches the condition, you should flag it so it knows not to bother generating any of the powersets of that subset.
It uses the simple concept of counting in binary to generate powersets, and this one starts at the high set (all elements in) and counts its way down. This way it should hit most (all?) "big" sets before hitting smaller subsets. I don't know if I can actually prove that's the case, but I think it is.
Since we're just counting, each set has a numeric index between 0 and 2n - 1. The assumption is, as we're going along, you can flag a set's as matching the condition. Then, any subsets of that set will not be generated.
First, the code.
The slick binary logic deserves explanation. let's assume that set ABC (1110) is a valid set that meets the condition. Set BC (0110) may meet it. To see if BC is a subset of ABC, just binary and them together. You should end up with the set you're testing (1110 & 0110 = 0110). If you do, it's a subset and you can skip. If not, it's not a subset, so continue with it.
To try and help illustrate, here's a graphical representation of the order in which the powersets get generated. Each row is the order in which the sets are generated (ABCD first, ABC second, ABD third,etc). Each column represents a subset (excepting that everything shoul be under ABCD). So you can see that ABC is generated (idx 14), and if it matches the condition, then it will skip over everything else in that column (AB, AC, A, BC, B, C).
Note that the sets are repeated under each bucket in which they would match. Technically, there would be additional columns off to set side for (AB), which A and B underneath it, but the diagram was getting busy as is. Note that each subset is only generated once (A) is not created 4x, it's just repeated in each column that has it as a subset.
Whoops- These subsets are actually backwards relative to how they're actually generated (its ABCD, BCD; not ABC, ABC) because of the reversal of the binary digits. I didn't realize that until after I'd spent the time building up the spiffy diagram and didn't want to re-create it with the proper order. The concept is the same, just the sets are produced in a slightly different order.
And there you have it. It's lightning fast, and memory efficient. For each subset that matches, you only need to store a single integer to skip over generation of its powerset. I guess the algorithm is O(n2) (or is it O(n log n)? I always goof those up), but that makes it sound scarier than it is - you need to iterate over each set index to see if you should skip it, but at each spot you're doing m bitwise ands for each set you've already determined you should skip. So say you know you're skipping 5 sets, that's at most 5 bitwise ands for each possible set index. Should be pretty cheap.