Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
I would like to find a way to generate the powerset such that all subsets below the current set can be skipped if a condition is met. I think a concrete example might better explain. Assume our set contains 'A' .. 'D';
I omitted the empty set as it has no practical purposes for my problem. The following re-ordering is an example of the optimization I am hoping to use:A, B, C, D AB, AC, AD, BC, BD, CD ABC, ABD, ACD, BCD ABCD
So if the condition for 'ABCD' was true, I would skip the entire powerset. If 'ABC' was true I would skip to 'ABD'. If 'ABD' was false but 'AD' was true I would skip to 'BD'.ABCD ABC AB A B AC C BC ABD AD D BD ACD CD BCD
This exact ordering isn't necessary and I am not sure if it helped explain my desire at all. Ultimately the goal is to generate the powerset for a list of sets but avoid duplication where possible. Using another example:
They share sets ABC, AB, AC, BC, A, B, and C so why generate them three times? I have used a similar technique in the past with success but I can't seem to wrap my head around how to do it here. Your thoughts and insights are appreciated.Set1: A,B,C,D Set2: A,B,C,E set3: A,B,C,F,G
Update: It was suggested in a /msg that I be more specific about the rules and not assume the examples are sufficient.
- If a set has previously been seen, all subsets of that set can be skipped
- The powerset should be generated in a manner that maximizes the potential for optimization. In other words, A,B,C should be generated before A,C
- No more sets should be generated than would otherwise be done using a straight forward iterative approach. In other words, A, B, C should produce only 7 candidates (or less if optimization possible)
Cheers - L~R
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Powerset short-circuit optimization
by ikegami (Patriarch) on Oct 03, 2006 at 18:17 UTC | |
Output: Read more... (373 Bytes)
Both the implementation and the algorithm can surely be improved. Update: The code block now returns a value. This allows the option to skip, as desired.
| [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Oct 03, 2006 at 22:13 UTC | |
[reply] | |
by ikegami (Patriarch) on Oct 03, 2006 at 23:29 UTC | |
The following only calls _v 15 times. ;) Of course, splice (or its optimized equivalent) and %seen are still called 29 times, but that's a far cry less than the possible 41. Also, even with the original code, the (presumably expensive) callback is only called 15 times.
It breaks the rule "A,B,C should be generated before A,C", but the following might allow you to break down the problem such that your requirements can be loosened:
| [reply] [d/l] [select] |
Re: Powerset short-circuit optimization
by grinder (Bishop) on Oct 03, 2006 at 16:48 UTC | |
You can think of elements within a powerset being chosen or not according to a bit vector. 5 elements requires five bits. You start at 00000, so your powerset has no elements. Then you increment, and get 00001. Thus, you take the last element. Some time later, after having incremented the counter for a while, you get to 01101, so you take the second, third and fifth elements. Finally, you get to 11111, and you take all elements. And then you stop. This is how my module Data::PowerSet is implemented (except that it starts at 11111 and counts down. Looking at the implementation, I don't see an aha! solution to make it jump over a region, but I'm sure with a savant dose of binary and'ing and or'ing you could make it do that. Patches welcome! • another intruder with the mooring in the heart of the Perl | [reply] |
by Limbic~Region (Chancellor) on Oct 03, 2006 at 16:57 UTC | |
Yes, I am very familiar with iterative methods at generating a powerset. I am also familiar with generating them in a certain order such that you can "jump" over a region. What I can't figure out how to do is generate them in the specific order that I would most benefit from in this situation. I do have an idea about using multiple iterators to generate the powerset itself instead of a single one. I am playing with that idea now. Cheers - L~R | [reply] |
Re: Powerset short-circuit optimization
by japhy (Canon) on Oct 03, 2006 at 16:28 UTC | |
| [reply] |
by Limbic~Region (Chancellor) on Oct 03, 2006 at 16:34 UTC | |
[reply] | |
Re: Powerset short-circuit optimization
by BrowserUk (Patriarch) on Oct 03, 2006 at 19:38 UTC | |
Would it be right to conclude that (assuming an efficient generator that doesn't generate duplicates), that you will only achieve this optimisation when generating multiple (related) powersets? That is, in your second example, Or, can the code assume that the sets are pre-ordered? Or should it sort them? Or does the definition of powersets imply some ordering? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Oct 03, 2006 at 21:42 UTC | |
You can assume the input set will be ordered alphabetically so your Ie. should not apply. You are also correct that no optimization is possible for a set that has no intersection with a previously generated powerset. If this doesn't answer all of your questions, please let me know. Cheers - L~R | [reply] |
Re: Powerset short-circuit optimization
by jimt (Chaplain) on Oct 25, 2006 at 15:15 UTC | |
I'm coming in a bit late to the party, but I think I've come up with an elegant and fast solution. I was tinkering around with the w() function from Weird number generator to see about different approaches. (that function, in case you all are curious will determine for a set of natural numbers N and a sum S, is there any subset of numbers in N that will add up to S). It's highly recursive and extremely golfed and really powerful, so be careful if you try to read it. Anyway, my original approach to it was to generate powersets and sum them, but that gets out of control ridiculously fast with memory requirements. Yesterday, [id://Limbic~Region] msg'ed me and said he may monkey around with optimizing it, and I, hypercompetitive asshole that I am, started investigating it as well. While in process, I came up with a nifty powerset generator (that I'll probably put over in snippets), and modified it to solve this problem here. Constraints on my version - it's all bitwise and binary, so it only handles sets up to 32 elements (or 64 if you're lucky), so it would require modification to handle larger sets. Also, it doesn't try to guess in advance which powersets it should and should not produce. It's based upon an iterator that generates sets as it goes along. If a set matches the condition, you should flag it so it knows not to bother generating any of the powersets of that subset. It uses the simple concept of counting in binary to generate powersets, and this one starts at the high set (all elements in) and counts its way down. This way it should hit most (all?) "big" sets before hitting smaller subsets. I don't know if I can actually prove that's the case, but I think it is. Since we're just counting, each set has a numeric index between 0 and 2n - 1. The assumption is, as we're going along, you can flag a set's as matching the condition. Then, any subsets of that set will not be generated. First, the code.
The slick binary logic deserves explanation. let's assume that set ABC (1110) is a valid set that meets the condition. Set BC (0110) may meet it. To see if BC is a subset of ABC, just binary and them together. You should end up with the set you're testing (1110 & 0110 = 0110). If you do, it's a subset and you can skip. If not, it's not a subset, so continue with it. To try and help illustrate, here's a graphical representation of the order in which the powersets get generated. Each row is the order in which the sets are generated (ABCD first, ABC second, ABD third,etc). Each column represents a subset (excepting that everything shoul be under ABCD). So you can see that ABC is generated (idx 14), and if it matches the condition, then it will skip over everything else in that column (AB, AC, A, BC, B, C). Note that the sets are repeated under each bucket in which they would match. Technically, there would be additional columns off to set side for (AB), which A and B underneath it, but the diagram was getting busy as is. Note that each subset is only generated once (A) is not created 4x, it's just repeated in each column that has it as a subset. Whoops- These subsets are actually backwards relative to how they're actually generated (its ABCD, BCD; not ABC, ABC) because of the reversal of the binary digits. I didn't realize that until after I'd spent the time building up the spiffy diagram and didn't want to re-create it with the proper order. The concept is the same, just the sets are produced in a slightly different order.
And there you have it. It's lightning fast, and memory efficient. For each subset that matches, you only need to store a single integer to skip over generation of its powerset. I guess the algorithm is O(n2) (or is it O(n log n)? I always goof those up), but that makes it sound scarier than it is - you need to iterate over each set index to see if you should skip it, but at each spot you're doing m bitwise ands for each set you've already determined you should skip. So say you know you're skipping 5 sets, that's at most 5 bitwise ands for each possible set index. Should be pretty cheap. | [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Oct 25, 2006 at 17:00 UTC | |
I admittedly did a poor job of explaining the nuances of this problem that make it difficult. I am going to attempt to clarify since you went through all the trouble of working on it though you may have already nailed it. You have N files that have to be processed in a specific order. Each line in each file is a set which needs to have the powerset (minus the empty set) generated for it. Any set that has been previously seen anywhere in any file can be skipped. While the sets inside a given file can be pre-ordered in any way that would assist in optimizations, the files themselves have to be processed in a specific order. File 1 is a list of our base sets. File 2 is a list of some base sets combined "two at a time". File 3 is a list of some base sets combined "three at a time". This process is repeated for N files. Ultimately, the powerset of the longest string in the last file will encompass every single set previously seen and should only produce a single set (itself).
I hope this complete example makes it clear what I am trying to accomplish. If your code can do what I describe above, please adapt your example accordingly. Cheers - L~R | [reply] [d/l] |
by jimt (Chaplain) on Oct 25, 2006 at 21:27 UTC | |
This is a friggin' awesome problem. Lots of fun working on it. :-) But I think a solution is at hand. My previous solution assumed that there was some sort of external criteria that would determine whether to skip generation of the powerset. Extensive private conversation with Limbic~Region plus his re-statement of the problem made it clear that that's not the case. Existence is the condition AND we need to worry about multiple sets and god knows what else. This will take time to explain. Read more... (14 kB)
Update: Fixed a minor logic glitch that could cause dupes with specially crafted sets. This code should now be complete. Update 2: Changed to pack as a little endian long (V) instead of just a long (L) to make cross platform compatible. Update 3: Okay, I'm man enough to admit that all of my slick binary logic actually introduced a horrrible inefficiency (looping to see if a set should be skipped). Fixed with a hash lookup instead. Duh. Update4: I think I've squeezed as much raw power out of this as I'm going to get. The newest version here requires the set elements be known in advance and it doen't print out the null set. Otherwise? It does work. Comparing against Limbic~Region's code below, mine is indeed still slower. Dang. Read more... (4 kB) | [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Nov 01, 2006 at 16:13 UTC | |
by creamygoodness (Curate) on Nov 06, 2006 at 02:45 UTC | |
by Limbic~Region (Chancellor) on Oct 26, 2006 at 13:28 UTC | |
Re: Powerset short-circuit optimization
by tilly (Archbishop) on Oct 04, 2006 at 18:51 UTC | |
Fun problem. Figuring out how my solution works is also fun. ;-) Incidentally the fact that you didn't want the empty set made life easier - I could just use the empty set as my "end of data" metadata marker. If I didn't have that, then I'd have had to use a less convenient API. (I'd have probably just returned array references instead.) Note that I suspect a naive approach such as the ones above is actually faster.
| [reply] [d/l] |
by BrowserUk (Patriarch) on Oct 04, 2006 at 20:46 UTC | |
Unless I've been wasting my time pursuing the wrong goal (which is quite possible), this isn't going to achieve the original aim. I'll try to explain, but sorry if I miss the mark. The original post said:
Which suggests the idea is to avoid regenerating those subsets that are a part of a second powerset, if you've already generated them for a previous powerset. To that end, you might use your code to generate the first powerset completely (metaphorically answering 'no' to the prompt), and then save a stringified version of each subset generated in a hash. $seen{ "@subset" } = 1 Then, when generating a second (or subsequent) powerset, each time you get back a subset, you look it up in the hash and use it's existance or otherwise to decide whether to skip the rest of that subset tree. skip() if exists $seen{ "@subset" } The problem is, for this to work, the code has to produce the same (number and order), subset sequence from any given starting point, but your code does not do this(*). Running your code on B C D and answering alternately 'yes' and 'no' at the top level shows what should be skipped when answering yes to that subset:
Now, running it on the set A B C D, we should be able to skip the generation of the 6 subsets [BC], [B], [C], [BD], [D], [CD], but when we run the code and answer 'yes' each time we see a sequence we saw during the generation of the B C D powerset, we still end up generating the same number of outputs:
This is because when generating a subtree that contains identical characters, but that is located at a different position in the base set, the order of generation is different, so the skipahead has a different effect. Occasionally, this will result in some saving, but often, as in the case above, none at all, and very rarely (from my analysis), will it achieve the full potential saving. (*). I've been rapidly reaching the conclusion in my own attempts that there isn't any way to do this. I've come up with 4 different ways of producing the powersets, each of which produce a different ordering, but none of them allow the potential saving to be achieved because the orderings are dependant upon the positions of the data in the set, not the values in those positions. Ie. * A B C * will never produced the same subtree as A B C * or * * A B C. Basically, in every generator I've seen or come up with, the values of the data are entirely transparent to the algorithm, as they manipulate positions in the set only. Which means any comparison/decisions about skipping based upon the values of the data are bound to fail. The only approach I've had any success with is to store not only the subset produced, but also their offset in the original base set (which is messy and complicated), and then only skip if you encounter both situations. The same data at the same offset. Whilst this would work, the reduction in the frequency with which the savings can be achieved, combined with the space requirements (and allocation costs) for storage, and the complexity of the comparisons, means that it's simply quicker to generate the sequence with a fast iterator. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Oct 04, 2006 at 22:51 UTC | |
You have nailed it. When I posted this, I originally thought that having each set to be processed in alphabetical sorting the sets by length in descending order would allow for this kind of optimization. In examining the problem this post is meant to help solve further, I have realized I will not be able to process the list in descending length even if it would make things work. Grrrrr! Cheers - L~R | [reply] [d/l] |
by tilly (Archbishop) on Oct 04, 2006 at 22:08 UTC | |
It generates the subsets in the exact order that the original post gave, and does so in a way in which you can easily skip all subsets of the current subset. Furthermore it does this with bounded memory requirements, and with somewhat reasonable performance. However it does not guarantee that it generates all longer sets before generating a short set. If you wish to guarantee that it generates all longer sets before generating a short set, then you'll have to explicitly remember all of the skips you were asked to make (with potentially large memory requirements to do that), and the skipping step is going to become very complex. Limbic-Region will have to confirm one way or another, but I think I made the performance tradeoff that he would want. | [reply] |
by Limbic~Region (Chancellor) on Oct 04, 2006 at 22:56 UTC | |
Re: Powerset short-circuit optimization
by creamygoodness (Curate) on Nov 06, 2006 at 01:56 UTC | |
Greets, Here's another entry for you, Limbic~Region. I believe it meets your criteria of checking the longest strings first.
| [reply] [d/l] |
by Limbic~Region (Chancellor) on Nov 06, 2006 at 15:59 UTC | |
Thanks. This will give me something to chew on for a while. I actually have a pretty advanced project that I would like to use C for but don't currently have the skillset for. Practice makes ... er um, well yeah. Update: Since you asked in the CB how this performed in comparison to mine. It takes only 61% of the time to produce the same output in my very rough benchmark (wall clock on a non-idle system). Cheers - L~R | [reply] |