http://www.perlmonks.org?node_id=580599


in reply to Powerset short-circuit optimization

I'm coming in a bit late to the party, but I think I've come up with an elegant and fast solution. I was tinkering around with the w() function from Weird number generator to see about different approaches. (that function, in case you all are curious will determine for a set of natural numbers N and a sum S, is there any subset of numbers in N that will add up to S). It's highly recursive and extremely golfed and really powerful, so be careful if you try to read it.

Anyway, my original approach to it was to generate powersets and sum them, but that gets out of control ridiculously fast with memory requirements. Yesterday, [id://Limbic~Region] msg'ed me and said he may monkey around with optimizing it, and I, hypercompetitive asshole that I am, started investigating it as well. While in process, I came up with a nifty powerset generator (that I'll probably put over in snippets), and modified it to solve this problem here.

Constraints on my version - it's all bitwise and binary, so it only handles sets up to 32 elements (or 64 if you're lucky), so it would require modification to handle larger sets. Also, it doesn't try to guess in advance which powersets it should and should not produce. It's based upon an iterator that generates sets as it goes along. If a set matches the condition, you should flag it so it knows not to bother generating any of the powersets of that subset.

It uses the simple concept of counting in binary to generate powersets, and this one starts at the high set (all elements in) and counts its way down. This way it should hit most (all?) "big" sets before hitting smaller subsets. I don't know if I can actually prove that's the case, but I think it is.

Since we're just counting, each set has a numeric index between 0 and 2n - 1. The assumption is, as we're going along, you can flag a set's as matching the condition. Then, any subsets of that set will not be generated.

First, the code.

# returns _2_ closures to generate certain powersets sub limbic_power_generator { my $set = shift; #we start with the original set and count down to the null set my $set_idx = 2 ** @$set; #these are the set indexes we should skip my %skippers = (); # our first closure generates subsets my $generator = sub { #bow out if we're out of sets return () unless $set_idx; # Start decrementing the set_idx. We always do this at least once, + so # we get to the next set. Our original number is 2 ** @set, so we +start # at 2 ** @set - 1 after the decrement while (1) { $set_idx--; #boolean to see if we should break out of this loop my $should_skip = 0; # here's the slick binary logic. Iterate over each superset we # should skip. if our current set_idx & (binary and) the skip se +t is equal # to the set_idx we're at, then we know that we have a subset of + that # skip set. So we skip this set. $should_skip is set to 1, which + means # we'll stay in our while loop and decrement down to the next se +t. foreach my $skip (keys %skippers) { if (($skip & $set_idx) == $set_idx) { $should_skip = 1; last; } } #bow out if this set is NOT a subset of any set we're skipping last unless $should_skip; #bow out of the function completely with the null set if we've h +it 0. return () unless $set_idx; } # Now we convert our set_idx to binary. Each bit stores whether th +e element # is in this subset. For example, set_idx 11 would be 1011, so we +keep # elements 0, 2, and 3. my @in_set = reverse split //, unpack("B*", pack("N", $set_idx)); # now we return a list. The first element is an arrayref which is th +e actual # subset we generated, the second is our set_idx. # we reverse it to make our lives easier, so don't be surprised when + the # counting gets out of order. Order is irrelevant in this case, anyw +ay. return ([map { $set->[$_] } grep { $in_set[$_] } (0..$#$set)], $se +t_idx); }; # our second closure allows you to add sets to skip my $skipper = sub { $skippers{$_[0]}++ }; # return both of our closures. return ($generator, $skipper) } #we'll use Limbic~Region's example set. my $limbic_set = [qw(A B C D)]; #create our iterator and skipper my ($limbic_iterator, $limbic_skipper) = limbic_power_generator($limbi +c_set); #and start cruising over our powersets. while ( my ($set, $idx) = $limbic_iterator->() ) { #fancy crap to get it to print out properly. my $display = {map {$_ => 1} @$set}; printf("%2s%2s%2s%2s (%d)\n", (map {$display->{$_} ? $_ : ' '} @$limbic_set), $idx); # now here's the trick, something here will determine a condition wh +ether or # not this subset matches the search parameters. Let's say, for sake + of # argument, that set_idx 7 (ABC) matches. We'll just set it here. $limbic_skipper->(7); # that will prevent sets (AB, AC, A, BC, B, C) from printing out. }

The slick binary logic deserves explanation. let's assume that set ABC (1110) is a valid set that meets the condition. Set BC (0110) may meet it. To see if BC is a subset of ABC, just binary and them together. You should end up with the set you're testing (1110 & 0110 = 0110). If you do, it's a subset and you can skip. If not, it's not a subset, so continue with it.

To try and help illustrate, here's a graphical representation of the order in which the powersets get generated. Each row is the order in which the sets are generated (ABCD first, ABC second, ABD third,etc). Each column represents a subset (excepting that everything shoul be under ABCD). So you can see that ABC is generated (idx 14), and if it matches the condition, then it will skip over everything else in that column (AB, AC, A, BC, B, C).

Note that the sets are repeated under each bucket in which they would match. Technically, there would be additional columns off to set side for (AB), which A and B underneath it, but the diagram was getting busy as is. Note that each subset is only generated once (A) is not created 4x, it's just repeated in each column that has it as a subset.

Whoops- These subsets are actually backwards relative to how they're actually generated (its ABCD, BCD; not ABC, ABC) because of the reversal of the binary digits. I didn't realize that until after I'd spent the time building up the spiffy diagram and didn't want to re-create it with the proper order. The concept is the same, just the sets are produced in a slightly different order.

ABCD| ABC |ABC | AB D| |AB D AB |AB |AB A CD| | |A CD A C |A C | |A C A D| |A D| A |A |A |A BCD| | | | BCD BC | BC | | | BC B D| | B D| | B D B | B | B | | B CD| | | CD| CD C | C | | C | C D| | D| D| D ()

And there you have it. It's lightning fast, and memory efficient. For each subset that matches, you only need to store a single integer to skip over generation of its powerset. I guess the algorithm is O(n2) (or is it O(n log n)? I always goof those up), but that makes it sound scarier than it is - you need to iterate over each set index to see if you should skip it, but at each spot you're doing m bitwise ands for each set you've already determined you should skip. So say you know you're skipping 5 sets, that's at most 5 bitwise ands for each possible set index. Should be pretty cheap.

Replies are listed 'Best First'.
Re^2: Powerset short-circuit optimization
by Limbic~Region (Chancellor) on Oct 25, 2006 at 17:00 UTC
    jimt,
    I admittedly did a poor job of explaining the nuances of this problem that make it difficult. I am going to attempt to clarify since you went through all the trouble of working on it though you may have already nailed it.

    You have N files that have to be processed in a specific order. Each line in each file is a set which needs to have the powerset (minus the empty set) generated for it. Any set that has been previously seen anywhere in any file can be skipped. While the sets inside a given file can be pre-ordered in any way that would assist in optimizations, the files themselves have to be processed in a specific order.

    File 1 is a list of our base sets. File 2 is a list of some base sets combined "two at a time". File 3 is a list of some base sets combined "three at a time". This process is repeated for N files. Ultimately, the powerset of the longest string in the last file will encompass every single set previously seen and should only produce a single set (itself).

    File1 Output ABC A, B, C, AB, AC, BC, ABC ABD D, AD, BD, ABD AB BC E E File2 ABCE AE, BE, CE, ABE, ACE, BCE, ABCE File3 ABCDE CD, DE, ACD, ADE, BCD, BDE, CDE, ABCD, ABDE, ACDE, BCDE, +ABCDE

    I hope this complete example makes it clear what I am trying to accomplish. If your code can do what I describe above, please adapt your example accordingly.

    Cheers - L~R

      This is a friggin' awesome problem. Lots of fun working on it. :-) But I think a solution is at hand.

      My previous solution assumed that there was some sort of external criteria that would determine whether to skip generation of the powerset. Extensive private conversation with Limbic~Region plus his re-statement of the problem made it clear that that's not the case. Existence is the condition AND we need to worry about multiple sets and god knows what else. This will take time to explain.

      Update: Fixed a minor logic glitch that could cause dupes with specially crafted sets. This code should now be complete.

      Update 2: Changed to pack as a little endian long (V) instead of just a long (L) to make cross platform compatible.

      Update 3: Okay, I'm man enough to admit that all of my slick binary logic actually introduced a horrrible inefficiency (looping to see if a set should be skipped). Fixed with a hash lookup instead. Duh.

      Update4: I think I've squeezed as much raw power out of this as I'm going to get. The newest version here requires the set elements be known in advance and it doen't print out the null set. Otherwise? It does work. Comparing against Limbic~Region's code below, mine is indeed still slower. Dang.

        jimt,
        I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
        • Pure perl ver 1 = 343 min
        • Pure perl ver 2 = 288 min
        • Pure perl ver 3 = 251 min
        • Java = 66 min
        • Perl + Inline::C = 3 min

        Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

        Cheers - L~R

        jimt,
        I am sad to tell you that despite providing an iterative version that need not be called more than necessary, it is terribly slow. I timed (not Benchmarked) 4 versions and unfortunately this wasn't even a contender:
        • java recursive 1 = 12 seconds
        • perl recursive 1 = 13 seconds
        • perl recursive 2 = 28 seconds
        • perl iterative 1 = 6_190 seconds (only 25% done, getting slower, and producing duplicates)

        The code to generate the 3_477 line data file and the recursive java version can be found at How many words does it take?. The two recursive perl versions are below:

        I made minor modifications to your code to handle my dataset as well as produce comparable output:

        Update 1: After your 3rd update, your code finished in a respectable 78 seconds. Unfortunately it is still producing about 40% duplicates. Additionally, it doesn't produce the correct output (missing missing 92_835 strings out of 508_062). For instance 'cdglnst' does not appear at all in your output.

        Update 2: After your 4th update, your code narrowly makes 3rd place with 26 seconds and correct output! I included the entire perl script I am using above to ensure we are comparing apples to apples. Admittedly, yours does scale much better with both speed and memory. Unfortunately, it still isn't quite up to the task I needed. I will have to put this in my back pocket for later though.

        Cheers - L~R