Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: Powerset short-circuit optimization

by jimt (Chaplain)
on Oct 25, 2006 at 21:27 UTC ( [id://580656]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

This is a friggin' awesome problem. Lots of fun working on it. :-) But I think a solution is at hand.

My previous solution assumed that there was some sort of external criteria that would determine whether to skip generation of the powerset. Extensive private conversation with Limbic~Region plus his re-statement of the problem made it clear that that's not the case. Existence is the condition AND we need to worry about multiple sets and god knows what else. This will take time to explain.

I will run through some theory an explanation, then post the modified code, then further explain that.

The issue that other people were having was placement of the characters in the set. So if you start off with set ABC, you'll know to ignore AB. However, you need to keep track of "AB" somewhere in memory and then do a string comparison or the like. Expensive.

My original approach simply used set enumeration to flag sets as being valid or dupes or whatnot. But that wont work here. With the example set ABC, you'd flag BC (idx 6 by how my function counts). BCD comes along, but its idx 6 is CD. So you'd improperly skip over CD. Further, its BC is idx 3, which you'd duplicate. Not good.

We're going to fix this by using placeholders in the set by means of dead bits. This is far cheaper than you would initially think.

For set A B C, you can enumerate the powerset in binary: 111 ABC 110 AB 101 A C 100 A 011 BC 010 B 001 C 000 For BCD, it would be thus: 111 BCD 110 BC 101 B C 100 B 011 CD 010 C 001 D 000 Instead, we want to offset the values in BCD and leave an empty slot f +or A. We'll end up with this: 0111 BCD 0110 BC 0101 B C 0100 B 0011 CD 0010 C 0001 D 0000

BC is 011 in our original set, and hey, in our newly modified set with the dead bit at position 0, BC is once again 011. So now we can safely exclude this set.

Note - this does not require the elements to be in sorted order, or even for you to know them in advance. We'll use an LZW style approach to keep assigning slots as we see them.

(for ABC) Have we seen A before? No? A = slot 0 Have we seen B before? No? B = slot 1 Have we seen C before? No? C = slot 2 (for BCD) Have we seen B before? YES - return 1 Have we seen C before? YES - return 2 Have we seen D before? No? D == slot 3

The order is irrelevent, but it does require that we build up an additional hash table mapping elements -> slots. Memory intensive, but relatively cheap - we only need to look into this index at the start of processing on a set.

But wait! You say, I glossed over a whole set of values up at the top! Let's return to them:

By using a dead bit for A, we actually have this: 1111 invalid 1110 invalid 1101 invalid 1100 invalid 1011 invalid 1010 invalid 1001 invalid 1000 invalid 0111 BCD 0110 BC 0101 B C 0100 B 0011 CD 0010 C 0001 D 0000

That's hugely inefficient. That's 8 checks we need to do just to get to a valid set. It spirals horribly out of control as you add more dead bits. Say, for example, you were using all letters of the alphabet. Your last set is just (z). By that point, z is in slot 25. So you'd have:

11111111111111111111111111 invalid 11111111111111111111111110 invalid 11111111111111111111111101 invalid . . . 00000000000000000000000010 invalid 00000000000000000000000001 valid (z)

Thats 67,108,864 iterations through invalid sets before you find your one valid one. Fortunately, an xor can cause us to solve this in constant time.

Build up a new digit that is the |'ed (bitwise or) string of all of your dead bits. In this case, we'd have 11111111111111111111111110. Now, as part of your index incrementing loop, you & (bitwise and) your set candidate with the deadbit string. If you get back 0, then you're fine (there are no deadbits in your set index, so you continue with it. If you get back something other than zero, then you ^ (bitwise xor) your index with the deadbit string and then & it with the original (to ensure you don't accidentally turn on any bits that were previously off). This will remove all deadbits from your string and jump you to the next valid index past the deadbits.

Examples are in order:

From up above, set (z) start at 11111111111111111111111111 deadbit string = 11111111111111111111111110 start & deadbit = 11111111111111111111111110 this is > 0, so xor. (start ^ deadbit) & start = 00000000000000000000000001 Bang! one and, a conditional, and an xor jumped you past 67 million so +me odd invalid sets. Another one, from above, set (ABCD) start at 1111 deadbit string = 1110 start & deadbit = 1110 this is > 0, so xor (start ^ deadbit) & start = 0111 (starts at BCD Here's a more complex example. Assume our first set was ABC, and our n +ext was ABD. (A = 0, B = 1, C = 2, D = 3). When we hit ABD, we flag C + as our deadbit and our deadbit string would be 0010. Let's try a complete runthrough: 1111 1111 & 0010 > 0 -> (1111 ^ 0010) & 1111 -> new index is 1101 1101 ABD (1101 & 0010 == 0) 1100 AB (1100 & 0010 == 0) 1011 (1011 & 0010 == 0) 1011 & 0010 > 0 -> (1011 ^ 0010) & 1011 -> new index is 1001 1001 A D (1001 & 0010 == 0) 1000 A (1000 & 0010 == 0) 0111 0111 & 0010 > 0 -> (0111 ^ 0010) & 0111 -> new index is 0101 0101 B D (0101 & 0010 == 0) 0100 B (0100 & 0010 == 0) 0011 0011 & 0010 > 0 -> (0011 ^ 0010) & 0010 -> new index is 0001 0001 D (0001 & 0010 == 0) 0000 () (0000 & 0010 == 0)

The code must be modified to incorporate the deadbits, as well as given the ability to return and re-use our previously skipped indexes. Modifications are as follows. There's a fair bit of additional set up an logic required vs. the previous version, but it can spit out all the numbers as desired from node 580625 with only 39 calls to the iterator and by storing only 32 integers in memory. Most of the additional work and overhead are bitwise operations.

use strict; use warnings; # returns _3_ closures to generate certain powersets #arbitrary benchmark device, used to se how many times the iterator wa +s called. my $calls = 0; sub limbic_power_generator { my $set = shift; # we re-define skippers as an array and allow the user to pass it in +, so we # can keep track of previously skipped values. The value of the hash + was # that it prevents dupes. Note - it would be better to mod the old v +ersion # to always use a descendingly sorted array, but is left to the read +er. my $skippers = shift || {}; my $deadbits = shift || 0; #we start with the original set and count down to the null set my $set_idx = 2 ** @$set; #these are the set indexes we should skip my %skippers = (); # our first closure generates subsets my $generator = sub { # arbitrary benchmark device, that way you can see how may times t +he iterator # was called $calls++; #bow out if we're out of sets return () unless $set_idx; # Start decrementing the set_idx. We always do this at least once, + so # we get to the next set. Our original number is 2 ** @set, so we +start # at 2 ** @set - 1 after the decrement while (1) { $set_idx--; # check to see if this set contains a deadbit, and if so hop ove +r it. if ($set_idx & $deadbits) { # make sure that we don't accidentally jump up to a higher set + index. # this can happen if you have deadbits beyond the length of yo +ur set $set_idx = ($set_idx ^ $deadbits) & $set_idx; } #bow out if this set is NOT a subset of any set we're skipping last unless $skippers->{$set_idx}; #bow out of the function completely with the null set if we've h +it 0. return () unless $set_idx; } # Now we convert our set_idx to binary. Each bit stores whether th +e element # is in this subset. For example, set_idx 11 would be 1011, so we +keep # elements 0, 2, and 3. my @in_set = split //, unpack("b*", pack("V",$set_idx)); # now we return a list. The first element is an arrayref which is th +e actual # subset we generated, the second is our set_idx. return ([map { $set->[$_] } grep { $in_set[$_] } (0..$#$set)], $se +t_idx); }; # our second closure allows you to add sets to skip # it also returns the list of skipped values my $skipper = sub { if (@_) { my $skip_key = shift; $skippers->{$skip_key}++; } return $skippers; }; # return both of our closures. return ($generator, $skipper) } # we'll use the example sets from node 580625 my $limbic_sets = [ [qw(A B C)], [qw(A B D)], [qw(A B)], [qw(B C)], [qw(E)], [qw(A B C E)], [qw(A B C D E)], ]; # our index lookup hash. There are potential savings by pre-caching th +ese values # if all elements are known in advance. my %idx = (); my $next_open_idx = 0; # our sets to skip my %skippers = (); foreach my $limbic_set (@$limbic_sets) { print "checks set @$limbic_set\n"; # we need to keep track of which indexes are dead, so we copy the # known indexes my %dead_idx = %idx; # here we'll keep track of the bits that are dead my $deadbits = 0; # we now need to iterate over our set. If we know the index of tha +t element # then great. That means we've seen it before, and it's currently +live, so # delete it from our list of dead bits. # # otherwise, assign it a new index. foreach my $elem (@$limbic_set) { if (defined $idx{$elem}) { delete $dead_idx{$elem}; } else { $idx{$elem} = $next_open_idx++; } } #here we're going to store the indexes which are dead my %dead_lookup = (); # iterate over our dead elements list, and toss it into the deadbi +ts string # and add its index to the lookup foreach my $idx (values %dead_idx) { $deadbits |= 2 ** $idx; $dead_lookup{$idx}++; } # we need to pad out set with dead bits. So if we call with (ABC), + then later # with (ABD), we need to turn that into (AB D) my $padded_limbic_set = []; my $padded_limbic_idx = 0; foreach my $idx (0..$#$limbic_set) { # if that index is dead, then toss in a placeholder and shift +the array # element forward. This is using parallel indexes, there may b +e a more # efficient method. if ($dead_lookup{$padded_limbic_idx}) { $padded_limbic_set->[$padded_limbic_idx++] = undef; redo; } $padded_limbic_set->[$padded_limbic_idx++] = $limbic_set->[$id +x]; } # get our iterators, using the padded set, skippers, and deadbits. my ($limbic_iterator, $limbic_skipper) = limbic_power_generator($p +added_limbic_set, \%skippers, $deadbits); #as we see an eleemnt, we're going to add it to this list, so we s +kip it on the next pass. my %future_skippers = (); #and start cruising over our powersets. while ( my ($set, $idx) = $limbic_iterator->() ) { #fancy crap to get it to print out properly. my $display = {map {$_ => 1} grep {defined} @$set}; my $format = "%2s" x scalar(@$padded_limbic_set) . " (%d)\n"; printf($format, (map {defined $_ && $display->{$_} ? $_ : ' '} @ +$padded_limbic_set), $idx); #we don't skip anything in this pass, but we'll do it the next t +ime around. $future_skippers{$idx}++; } @skippers{keys %future_skippers} = values %future_skippers; } print "TOTAL CALLS $calls\n";

Update: Fixed a minor logic glitch that could cause dupes with specially crafted sets. This code should now be complete.

Update 2: Changed to pack as a little endian long (V) instead of just a long (L) to make cross platform compatible.

Update 3: Okay, I'm man enough to admit that all of my slick binary logic actually introduced a horrrible inefficiency (looping to see if a set should be skipped). Fixed with a hash lookup instead. Duh.

Update4: I think I've squeezed as much raw power out of this as I'm going to get. The newest version here requires the set elements be known in advance and it doen't print out the null set. Otherwise? It does work. Comparing against Limbic~Region's code below, mine is indeed still slower. Dang.

Current state of the art. Expects the data file to be passed on the command line:

#!/usr/bin/perl use strict; use warnings; sub set_generator { my ($set, $set_idx, $skippers, $deadbits) = @_; # Start decrementing the set_idx. We always do this at least once, + so # we get to the next set. Our original number is 2 ** @set, so we +start # at 2 ** @set - 1 after the decrement while ($set_idx--) { # check to see if this set contains a deadbit, and if so hop ove +r it. # make sure that we don't accidentally jump up to a higher set i +ndex. # this can happen if you have deadbits beyond the length of your + set $set_idx = ($set_idx ^ $deadbits) & $set_idx; #bow out if this set is NOT a subset of any set we're skipping last unless $skippers->{$set_idx}; #bow out of the function completely with the null set if we've h +it 0. } #bow out if we're out of sets return () unless $set_idx; # Now we convert our set_idx to binary. Each bit stores whether th +e element # is in this subset. For example, set_idx 11 would be 1011, so we +keep # elements 0, 2, and 3. my @in_set = split //, unpack("b*", pack("L",$set_idx)); # now we return a list. The first element is an arrayref which is th +e actual # subset we generated, the second is our set_idx. return ([map { $set->[$_] } grep { $in_set[$_] } (0..$#$set)], $se +t_idx); }; # our index lookup hash. Assume letters 'a'-'z'. The original version +of # dynamically assigning indexes had a few bugs. Whoops. So now you're +required # to know all elements in advance. Damn. I'll fix this later, I guess. my @letters = ('a'..'z'); my %idx = map {$letters[$_] => $_} (0..25); # our sets to skip my %skippers = (); while (my $limbic_string = <>) { chomp $limbic_string; my $limbic_set = [split //, $limbic_string]; # we need to keep track of which indexes are dead, so we copy the # known indexes my %dead_idx = %idx; # here we'll keep track of the bits that are dead my $deadbits = 0; #remove the live bits from the dead set. delete @dead_idx{@$limbic_set}; # iterate over our dead elements list, and toss it into the deadbi +ts string # and splice in an undef to our list. foreach my $idx (sort {$a <=> $b} values %dead_idx) { $deadbits |= 2 ** $idx; splice @$limbic_set, $idx, 0, undef unless $idx >= @$limbic_se +t; } #as we see an element, we're going to add it to this list, so we s +kip it on the next pass. my @future_skippers = (); my $set_idx = 2 ** @$limbic_set; #and start cruising over our powersets. while (my ($set, $newidx) = set_generator($limbic_set, $set_idx, \ +%skippers, $deadbits)) { print @$set,"\n"; #note our new set index $set_idx = $newidx; #we don't skip anything in this pass, but we'll do it the next t +ime around. push @future_skippers, $set_idx; } @skippers{@future_skippers} = (1) x @future_skippers; }

Replies are listed 'Best First'.
Re^4: Powerset short-circuit optimization
by Limbic~Region (Chancellor) on Nov 01, 2006 at 16:13 UTC
    jimt,
    I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
    • Pure perl ver 1 = 343 min
    • Pure perl ver 2 = 288 min
    • Pure perl ver 3 = 251 min
    • Java = 66 min
    • Perl + Inline::C = 3 min

    Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

    Cheers - L~R

      Hi, L~R...

      First the good news: the algorithm seems fine to me, and there's only one language-specific gotcha I found that could have caused you problems. That's very impressive for someone with next-to-no C experience.

      Now the critique: This is hard code to review. The explanation helps, but there are a lot of things that get in the way.

      • Code without any comments at all is lame. Lame, lame, lame. Especially when its A) dense, and B) implements an unusual algorithm.
      • Those initializations wrap in every browser I checked. Do they wrap in yours? Looks like $h$_ amigo! Do you care?
      • While single line for/while/if blocks are a common Perl idiom, they are rare in C code.
      • If the variable has nothing to do with a "bit", don't call it bit. Yes, I know now that's an artifact of having translated Perl code that used a bit vector. But coming in fresh, I didn't, and it threw me off.
      • 4 out of 5 coders surveyed recommend naming variables with meaningful monikers rather than nebulosities like "val".
      • There are no blank lines to break up that big C function into chunks. It would be much easier to digest if it were in paragraphs -- ideally, commented paragraphs, as Damian recommends in Perl Best Practices.

      With all those impediments, the only way this ordinary human was able to get through that and follow it from start to finish was to rewrite it myself. In so doing, I'm happy to report the only thing I found that was really off was this:

      memset(seen, '0', sizeof(seen)); /* ... snip ... */ if (seen[bit] == '1') { return; } seen[bit] = '1';

      In C, '0' != 0 and '1' != 1. Those single quotes return the integer value of the enclosed character, which, for '0' on an ascii machine is not 0 but 48. :) You got away with it because you were consistent, though. :) That memset() should look like this:

      memset(seen, 0, sizeof(seen));
      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
Re^4: Powerset short-circuit optimization
by Limbic~Region (Chancellor) on Oct 26, 2006 at 13:28 UTC
    jimt,
    I am sad to tell you that despite providing an iterative version that need not be called more than necessary, it is terribly slow. I timed (not Benchmarked) 4 versions and unfortunately this wasn't even a contender:
    • java recursive 1 = 12 seconds
    • perl recursive 1 = 13 seconds
    • perl recursive 2 = 28 seconds
    • perl iterative 1 = 6_190 seconds (only 25% done, getting slower, and producing duplicates)

    The code to generate the 3_477 line data file and the recursive java version can be found at How many words does it take?. The two recursive perl versions are below:

    I made minor modifications to your code to handle my dataset as well as produce comparable output:

    Update 1: After your 3rd update, your code finished in a respectable 78 seconds. Unfortunately it is still producing about 40% duplicates. Additionally, it doesn't produce the correct output (missing missing 92_835 strings out of 508_062). For instance 'cdglnst' does not appear at all in your output.

    Update 2: After your 4th update, your code narrowly makes 3rd place with 26 seconds and correct output! I included the entire perl script I am using above to ensure we are comparing apples to apples. Admittedly, yours does scale much better with both speed and memory. Unfortunately, it still isn't quite up to the task I needed. I will have to put this in my back pocket for later though.

    Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://580656]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-19 08:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found