Problems? Is your data what you think it is? PerlMonks

### Powerset short-circuit optimization

by Limbic~Region (Chancellor)
 on Oct 03, 2006 at 15:23 UTC Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I would like to find a way to generate the powerset such that all subsets below the current set can be skipped if a condition is met. I think a concrete example might better explain. Assume our set contains 'A' .. 'D';
A, B, C, D AB, AC, AD, BC, BD, CD ABC, ABD, ACD, BCD ABCD
I omitted the empty set as it has no practical purposes for my problem. The following re-ordering is an example of the optimization I am hoping to use:
ABCD ABC AB A B AC C BC ABD AD D BD ACD CD BCD
So if the condition for 'ABCD' was true, I would skip the entire powerset. If 'ABC' was true I would skip to 'ABD'. If 'ABD' was false but 'AD' was true I would skip to 'BD'.

This exact ordering isn't necessary and I am not sure if it helped explain my desire at all. Ultimately the goal is to generate the powerset for a list of sets but avoid duplication where possible. Using another example:

Set1: A,B,C,D Set2: A,B,C,E set3: A,B,C,F,G
They share sets ABC, AB, AC, BC, A, B, and C so why generate them three times? I have used a similar technique in the past with success but I can't seem to wrap my head around how to do it here. Your thoughts and insights are appreciated.

Update: It was suggested in a /msg that I be more specific about the rules and not assume the examples are sufficient.

• If a set has previously been seen, all subsets of that set can be skipped
• The powerset should be generated in a manner that maximizes the potential for optimization. In other words, A,B,C should be generated before A,C
• No more sets should be generated than would otherwise be done using a straight forward iterative approach. In other words, A, B, C should produce only 7 candidates (or less if optimization possible)

Cheers - L~R

Replies are listed 'Best First'.
Re: Powerset short-circuit optimization
by ikegami (Pope) on Oct 03, 2006 at 18:17 UTC
An initial naïve solution:
use strict; use warnings; sub r(&\$@) { my \$cb = shift(@_); my %seen; local *_r = sub { my @v = @_; return if \$seen{join \$;, @v}++; return if not \$cb->(@v); return if @v == 1; for (0..\$#v) { my @v_ = @v; splice(@v_, \$#v-\$_, 1); _r(@v_); } }; _r(sort @\$_) foreach @_; } r { print(@_, "\n"); 1 } [ qw( A B C D ) ], [ qw( A B C E ) ], [ qw( A B C F G ) ];

Output:

Both the implementation and the algorithm can surely be improved.

Update: The code block now returns a value. This allows the option to skip, as desired.
Update: Slightly modified to attain your ultimate goal.
Update: I can't think of another algorithm. This will likely be my last solution.

ikegami,
This is the approach I previously discussed in the CB with jdporter and blokhead. The undesireable aspect is that _r() is called 29 times to yield 15 items for A,B,C,D instead of just 15. I am not sure there is a way around this but that's why I posted it.

Cheers - L~R

The following only calls _v 15 times. ;) Of course, splice (or its optimized equivalent) and %seen are still called 29 times, but that's a far cry less than the possible 41. Also, even with the original code, the (presumably expensive) callback is only called 15 times.

local *_r = sub { my @v = @_; return if not \$cb->(@v); return if @v == 1; for (0..\$#v) { my @v_ = @v; splice(@v_, \$#v-\$_, 1); _r(@v_) if not \$seen{join \$;, @v_}++; } };

It breaks the rule "A,B,C should be generated before A,C", but the following might allow you to break down the problem such that your requirements can be loosened:

my (\$u_set1, \$u_set2, \$common) = extract_common(\$set1, \$set2); my \$psetc = powerset(\$common); my \$pset1 = product(\$psetc, powerset(\$u_set1)); my \$pset2 = product(\$psetc, powerset(\$u_set2)); my \$pset = union(\$pset1, \$pset2);
Re: Powerset short-circuit optimization
by grinder (Bishop) on Oct 03, 2006 at 16:48 UTC

You can think of elements within a powerset being chosen or not according to a bit vector. 5 elements requires five bits.

You start at 00000, so your powerset has no elements.

Then you increment, and get 00001. Thus, you take the last element. Some time later, after having incremented the counter for a while, you get to 01101, so you take the second, third and fifth elements.

Finally, you get to 11111, and you take all elements. And then you stop.

This is how my module Data::PowerSet is implemented (except that it starts at 11111 and counts down. Looking at the implementation, I don't see an aha! solution to make it jump over a region, but I'm sure with a savant dose of binary and'ing and or'ing you could make it do that.

Patches welcome!

• another intruder with the mooring in the heart of the Perl

grinder,
Yes, I am very familiar with iterative methods at generating a powerset. I am also familiar with generating them in a certain order such that you can "jump" over a region. What I can't figure out how to do is generate them in the specific order that I would most benefit from in this situation. I do have an idea about using multiple iterators to generate the powerset itself instead of a single one. I am playing with that idea now.

Cheers - L~R

Re: Powerset short-circuit optimization
by japhy (Canon) on Oct 03, 2006 at 16:28 UTC
Is your "skipping" rule that if a condition for XYZ is true, you needn't calculate powersets of XYZ?

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
japhy,
Yes, and I think I see where you are going. Please keep in mind though that changing the problem into a recursive powerset generator might not be a win. I have been discussing this approach in the CB with jdporter and blokhead.

Cheers - L~R

Re: Powerset short-circuit optimization
by BrowserUk (Pope) on Oct 03, 2006 at 19:38 UTC

Would it be right to conclude that (assuming an efficient generator that doesn't generate duplicates), that you will only achieve this optimisation when generating multiple (related) powersets?

That is, in your second example,

• there is no optimisation possible when generating set1.
• It is only possible to optimise the generation of set2 if that set is ordered such that some part of it forms an ordered subset of set1

Ie. If set2 were E, C, B, A, no optimisation would be possible?

Or, can the code assume that the sets are pre-ordered? Or should it sort them?

Or does the definition of powersets imply some ordering?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
BrowserUk,
You can assume the input set will be ordered alphabetically so your Ie. should not apply. You are also correct that no optimization is possible for a set that has no intersection with a previously generated powerset. If this doesn't answer all of your questions, please let me know.

Cheers - L~R

Re: Powerset short-circuit optimization
by jimt (Chaplain) on Oct 25, 2006 at 15:15 UTC

I'm coming in a bit late to the party, but I think I've come up with an elegant and fast solution. I was tinkering around with the w() function from Weird number generator to see about different approaches. (that function, in case you all are curious will determine for a set of natural numbers N and a sum S, is there any subset of numbers in N that will add up to S). It's highly recursive and extremely golfed and really powerful, so be careful if you try to read it.

Anyway, my original approach to it was to generate powersets and sum them, but that gets out of control ridiculously fast with memory requirements. Yesterday, Limbic~Region msg'ed me and said he may monkey around with optimizing it, and I, hypercompetitive asshole that I am, started investigating it as well. While in process, I came up with a nifty powerset generator (that I'll probably put over in snippets), and modified it to solve this problem here.

Constraints on my version - it's all bitwise and binary, so it only handles sets up to 32 elements (or 64 if you're lucky), so it would require modification to handle larger sets. Also, it doesn't try to guess in advance which powersets it should and should not produce. It's based upon an iterator that generates sets as it goes along. If a set matches the condition, you should flag it so it knows not to bother generating any of the powersets of that subset.

It uses the simple concept of counting in binary to generate powersets, and this one starts at the high set (all elements in) and counts its way down. This way it should hit most (all?) "big" sets before hitting smaller subsets. I don't know if I can actually prove that's the case, but I think it is.

Since we're just counting, each set has a numeric index between 0 and 2n - 1. The assumption is, as we're going along, you can flag a set's as matching the condition. Then, any subsets of that set will not be generated.

First, the code.

# returns _2_ closures to generate certain powersets sub limbic_power_generator { my \$set = shift; #we start with the original set and count down to the null set my \$set_idx = 2 ** @\$set; #these are the set indexes we should skip my %skippers = (); # our first closure generates subsets my \$generator = sub { #bow out if we're out of sets return () unless \$set_idx; # Start decrementing the set_idx. We always do this at least once, + so # we get to the next set. Our original number is 2 ** @set, so we +start # at 2 ** @set - 1 after the decrement while (1) { \$set_idx--; #boolean to see if we should break out of this loop my \$should_skip = 0; # here's the slick binary logic. Iterate over each superset we # should skip. if our current set_idx & (binary and) the skip se +t is equal # to the set_idx we're at, then we know that we have a subset of + that # skip set. So we skip this set. \$should_skip is set to 1, which + means # we'll stay in our while loop and decrement down to the next se +t. foreach my \$skip (keys %skippers) { if ((\$skip & \$set_idx) == \$set_idx) { \$should_skip = 1; last; } } #bow out if this set is NOT a subset of any set we're skipping last unless \$should_skip; #bow out of the function completely with the null set if we've h +it 0. return () unless \$set_idx; } # Now we convert our set_idx to binary. Each bit stores whether th +e element # is in this subset. For example, set_idx 11 would be 1011, so we +keep # elements 0, 2, and 3. my @in_set = reverse split //, unpack("B*", pack("N", \$set_idx)); # now we return a list. The first element is an arrayref which is th +e actual # subset we generated, the second is our set_idx. # we reverse it to make our lives easier, so don't be surprised when + the # counting gets out of order. Order is irrelevant in this case, anyw +ay. return ([map { \$set->[\$_] } grep { \$in_set[\$_] } (0..\$#\$set)], \$se +t_idx); }; # our second closure allows you to add sets to skip my \$skipper = sub { \$skippers{\$_[0]}++ }; # return both of our closures. return (\$generator, \$skipper) } #we'll use Limbic~Region's example set. my \$limbic_set = [qw(A B C D)]; #create our iterator and skipper my (\$limbic_iterator, \$limbic_skipper) = limbic_power_generator(\$limbi +c_set); #and start cruising over our powersets. while ( my (\$set, \$idx) = \$limbic_iterator->() ) { #fancy crap to get it to print out properly. my \$display = {map {\$_ => 1} @\$set}; printf("%2s%2s%2s%2s (%d)\n", (map {\$display->{\$_} ? \$_ : ' '} @\$limbic_set), \$idx); # now here's the trick, something here will determine a condition wh +ether or # not this subset matches the search parameters. Let's say, for sake + of # argument, that set_idx 7 (ABC) matches. We'll just set it here. \$limbic_skipper->(7); # that will prevent sets (AB, AC, A, BC, B, C) from printing out. }

The slick binary logic deserves explanation. let's assume that set ABC (1110) is a valid set that meets the condition. Set BC (0110) may meet it. To see if BC is a subset of ABC, just binary and them together. You should end up with the set you're testing (1110 & 0110 = 0110). If you do, it's a subset and you can skip. If not, it's not a subset, so continue with it.

To try and help illustrate, here's a graphical representation of the order in which the powersets get generated. Each row is the order in which the sets are generated (ABCD first, ABC second, ABD third,etc). Each column represents a subset (excepting that everything shoul be under ABCD). So you can see that ABC is generated (idx 14), and if it matches the condition, then it will skip over everything else in that column (AB, AC, A, BC, B, C).

Note that the sets are repeated under each bucket in which they would match. Technically, there would be additional columns off to set side for (AB), which A and B underneath it, but the diagram was getting busy as is. Note that each subset is only generated once (A) is not created 4x, it's just repeated in each column that has it as a subset.

Whoops- These subsets are actually backwards relative to how they're actually generated (its ABCD, BCD; not ABC, ABC) because of the reversal of the binary digits. I didn't realize that until after I'd spent the time building up the spiffy diagram and didn't want to re-create it with the proper order. The concept is the same, just the sets are produced in a slightly different order.

ABCD| ABC |ABC | AB D| |AB D AB |AB |AB A CD| | |A CD A C |A C | |A C A D| |A D| A |A |A |A BCD| | | | BCD BC | BC | | | BC B D| | B D| | B D B | B | B | | B CD| | | CD| CD C | C | | C | C D| | D| D| D ()

And there you have it. It's lightning fast, and memory efficient. For each subset that matches, you only need to store a single integer to skip over generation of its powerset. I guess the algorithm is O(n2) (or is it O(n log n)? I always goof those up), but that makes it sound scarier than it is - you need to iterate over each set index to see if you should skip it, but at each spot you're doing m bitwise ands for each set you've already determined you should skip. So say you know you're skipping 5 sets, that's at most 5 bitwise ands for each possible set index. Should be pretty cheap.

jimt,
I admittedly did a poor job of explaining the nuances of this problem that make it difficult. I am going to attempt to clarify since you went through all the trouble of working on it though you may have already nailed it.

You have N files that have to be processed in a specific order. Each line in each file is a set which needs to have the powerset (minus the empty set) generated for it. Any set that has been previously seen anywhere in any file can be skipped. While the sets inside a given file can be pre-ordered in any way that would assist in optimizations, the files themselves have to be processed in a specific order.

File 1 is a list of our base sets. File 2 is a list of some base sets combined "two at a time". File 3 is a list of some base sets combined "three at a time". This process is repeated for N files. Ultimately, the powerset of the longest string in the last file will encompass every single set previously seen and should only produce a single set (itself).

File1 Output ABC A, B, C, AB, AC, BC, ABC ABD D, AD, BD, ABD AB BC E E File2 ABCE AE, BE, CE, ABE, ACE, BCE, ABCE File3 ABCDE CD, DE, ACD, ADE, BCD, BDE, CDE, ABCD, ABDE, ACDE, BCDE, +ABCDE

I hope this complete example makes it clear what I am trying to accomplish. If your code can do what I describe above, please adapt your example accordingly.

Cheers - L~R

This is a friggin' awesome problem. Lots of fun working on it. :-) But I think a solution is at hand.

My previous solution assumed that there was some sort of external criteria that would determine whether to skip generation of the powerset. Extensive private conversation with Limbic~Region plus his re-statement of the problem made it clear that that's not the case. Existence is the condition AND we need to worry about multiple sets and god knows what else. This will take time to explain.

Update: Fixed a minor logic glitch that could cause dupes with specially crafted sets. This code should now be complete.

Update 2: Changed to pack as a little endian long (V) instead of just a long (L) to make cross platform compatible.

Update 3: Okay, I'm man enough to admit that all of my slick binary logic actually introduced a horrrible inefficiency (looping to see if a set should be skipped). Fixed with a hash lookup instead. Duh.

Update4: I think I've squeezed as much raw power out of this as I'm going to get. The newest version here requires the set elements be known in advance and it doen't print out the null set. Otherwise? It does work. Comparing against Limbic~Region's code below, mine is indeed still slower. Dang.

Re: Powerset short-circuit optimization
by tilly (Archbishop) on Oct 04, 2006 at 18:51 UTC
Sorry for taking so long to do this. I saw how to do it fairly quickly but didn't get time to write it up.

Fun problem. Figuring out how my solution works is also fun. ;-)

Incidentally the fact that you didn't want the empty set made life easier - I could just use the empty set as my "end of data" metadata marker. If I didn't have that, then I'd have had to use a less convenient API. (I'd have probably just returned array references instead.)

Note that I suspect a naive approach such as the ones above is actually faster.

#! /usr/bin/perl -w use strict; my \$iterator = Set::IterateAndSkip->new( @ARGV ? @ARGV : 'A'..'D' ); while (my @subset = \$iterator->next) { print @subset, ":\tskip? "; if (<STDIN> =~ /y/i) { \$iterator->skip_subsets; } } print "Done\n"; package Set::IterateAndSkip; sub new { my \$class = shift; my \$set = [map {[1, \$_]} @_]; \$set->[0][0] = 0; return bless { set => \$set, }, \$class; } sub end_of_first_run { my \$self = shift; my \$set = \$self->{set}; my \$n = -1; for (@\$set) { if (\$_->[0]) { \$n++; } else { last; } } return \$n; } sub skip_subsets { my \$self = shift; \$self->{set}[0][0] = 0; } sub next { my \$self = shift; my \$set = \$self->{set}; my \$end = \$self->end_of_first_run; if (-1 < \$end) { \$set->[\$end][0] = 0; } else { \$set->[0][0] = 1; \$end = \$self->end_of_first_run; if (\$end == \$#\$set) { if (\$self->{started}) { # Reinitialize in case we want to cycle through again. delete \$self->{started}; \$set->[0][0] = -1; return (); } else { \$self->{started} = 1; } } else { \$set->[\$end][0] = 0; \$set->[\$end + 1][0] = 1; } } my @result = map {\$_->[0] ? \$_->[1] : ()} @\$set; # Need to skip the empty set. return @result ? @result : \$self->next(); }

Unless I've been wasting my time pursuing the wrong goal (which is quite possible), this isn't going to achieve the original aim. I'll try to explain, but sorry if I miss the mark.

The original post said:

Using another example:

Set1: A,B,C,D Set2: A,B,C,E set3: A,B,C,F,G

They share sets ABC, AB, AC, BC, A, B, and C so why generate them three times?

Which suggests the idea is to avoid regenerating those subsets that are a part of a second powerset, if you've already generated them for a previous powerset. To that end, you might use your code to generate the first powerset completely (metaphorically answering 'no' to the prompt), and then save a stringified version of each subset generated in a hash. \$seen{ "@subset" } = 1

Then, when generating a second (or subsequent) powerset, each time you get back a subset, you look it up in the hash and use it's existance or otherwise to decide whether to skip the rest of that subset tree.

skip() if exists \$seen{ "@subset" }

The problem is, for this to work, the code has to produce the same (number and order), subset sequence from any given starting point, but your code does not do this(*).

Running your code on B C D and answering alternately 'yes' and 'no' at the top level shows what should be skipped when answering yes to that subset:

C:\test>tilly-powersets B C D C:\test>tilly-powersets B C D BCD: skip? n BCD: skip? y BC: skip? n Done B: skip? n C: skip? n BD: skip? n D: skip? n CD: skip? n Done

Now, running it on the set A B C D, we should be able to skip the generation of the 6 subsets [BC], [B], [C], [BD], [D], [CD], but when we run the code and answer 'yes' each time we see a sequence we saw during the generation of the B C D powerset, we still end up generating the same number of outputs:

>tilly-powersets A B C D >tilly-powersets A B C D ABCD: skip? n ABCD: skip? n ABC: skip? n ABC: skip? n AB: skip? n AB: skip? n A: skip? n A: skip? n B: skip? n B: skip? y AC: skip? n AC: skip? n C: skip? n C: skip? y BC: skip? n BC: skip? y ABD: skip? n ABD: skip? n AD: skip? n AD: skip? n D: skip? n D: skip? y BD: skip? n BD: skip? y ACD: skip? n ACD: skip? n CD: skip? n CD: skip? y BCD: skip? n BCD: skip? y Done Done

This is because when generating a subtree that contains identical characters, but that is located at a different position in the base set, the order of generation is different, so the skipahead has a different effect. Occasionally, this will result in some saving, but often, as in the case above, none at all, and very rarely (from my analysis), will it achieve the full potential saving.

(*). I've been rapidly reaching the conclusion in my own attempts that there isn't any way to do this.

I've come up with 4 different ways of producing the powersets, each of which produce a different ordering, but none of them allow the potential saving to be achieved because the orderings are dependant upon the positions of the data in the set, not the values in those positions. Ie. * A B C * will never produced the same subtree as A B C * or * * A B C.

Basically, in every generator I've seen or come up with, the values of the data are entirely transparent to the algorithm, as they manipulate positions in the set only. Which means any comparison/decisions about skipping based upon the values of the data are bound to fail.

The only approach I've had any success with is to store not only the subset produced, but also their offset in the original base set (which is messy and complicated), and then only skip if you encounter both situations. The same data at the same offset. Whilst this would work, the reduction in the frequency with which the savings can be achieved, combined with the space requirements (and allocation costs) for storage, and the complexity of the comparisons, means that it's simply quicker to generate the sequence with a fast iterator.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
BrowserUk,
You have nailed it. When I posted this, I originally thought that having each set to be processed in alphabetical sorting the sets by length in descending order would allow for this kind of optimization.

In examining the problem this post is meant to help solve further, I have realized I will not be able to process the list in descending length even if it would make things work.

Grrrrr!

Cheers - L~R

Let me say what my code does, and what it doesn't do.

It generates the subsets in the exact order that the original post gave, and does so in a way in which you can easily skip all subsets of the current subset. Furthermore it does this with bounded memory requirements, and with somewhat reasonable performance. However it does not guarantee that it generates all longer sets before generating a short set.

If you wish to guarantee that it generates all longer sets before generating a short set, then you'll have to explicitly remember all of the skips you were asked to make (with potentially large memory requirements to do that), and the skipping step is going to become very complex. Limbic-Region will have to confirm one way or another, but I think I made the performance tradeoff that he would want.

Re: Powerset short-circuit optimization
by creamygoodness (Curate) on Nov 06, 2006 at 01:56 UTC

Greets,

Here's another entry for you, Limbic~Region. I believe it meets your criteria of checking the longest strings first.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com
creamygoodness,
Thanks. This will give me something to chew on for a while. I actually have a pretty advanced project that I would like to use C for but don't currently have the skillset for. Practice makes ... er um, well yeah.

Update: Since you asked in the CB how this performed in comparison to mine. It takes only 61% of the time to produce the same output in my very rough benchmark (wall clock on a non-idle system).

Cheers - L~R

Create A New User
Node Status?
node history
Node Type: perlquestion [id://576101]
Approved by Tanktalus
Front-paged by andyford
help
Chatterbox?
 [stevieb]: choroba my GPS home device is done, at least prototype v1 is. I'm just making sure all of my code does what it says now [stevieb]: I test/prototype code on my Pi, then I migrate it over to Arduino when it's ready (if that's a better platform for what I'm doing). The whole RPi stuff was for learning at first, then it snowballed. I'm just trying to get my in-house CI... [stevieb]: ...done, which obviously requires a specific hardware setup. [stevieb]: I don't write tests for the lower-level distributions (ie. ICs, sensors etc) within those distributions, I have all tests in a master distribution which encompasess all of the sub-modules

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2017-06-25 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How many monitors do you use while coding?

Results (572 votes). Check out past polls.