Re^2: Powerset short-circuit optimization

in reply to Re: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

Unless I've been wasting my time pursuing the wrong goal (which is quite possible), this isn't going to achieve the original aim. I'll try to explain, but sorry if I miss the mark.

The original post said:

Using another example:
Set1: A,B,C,D Set2: A,B,C,E set3: A,B,C,F,G
[download]

They share sets ABC, AB, AC, BC, A, B, and C so why generate them three times?

Which suggests the idea is to avoid regenerating those subsets that are a part of a second powerset, if you've already generated them for a previous powerset. To that end, you might use your code to generate the first powerset completely (metaphorically answering 'no' to the prompt), and then save a stringified version of each subset generated in a hash. $seen{ "@subset" } = 1

Then, when generating a second (or subsequent) powerset, each time you get back a subset, you look it up in the hash and use it's existance or otherwise to decide whether to skip the rest of that subset tree.

skip() if exists $seen{ "@subset" }

The problem is, for this to work, the code has to produce the same (number and order), subset sequence from any given starting point, but your code does not do this(*).

Running your code on B C D and answering alternately 'yes' and 'no' at the top level shows what should be skipped when answering yes to that subset:

C:\test>tilly-powersets B C D           C:\test>tilly-powersets B C D
BCD:    skip? n                         BCD:    skip? y
BC:     skip? n                         Done
B:      skip? n
C:      skip? n
BD:     skip? n
D:      skip? n
CD:     skip? n
Done
[download]

Now, running it on the set A B C D, we should be able to skip the generation of the 6 subsets [BC], [B], [C], [BD], [D], [CD], but when we run the code and answer 'yes' each time we see a sequence we saw during the generation of the B C D powerset, we still end up generating the same number of outputs:

>tilly-powersets A B C D     >tilly-powersets A B C D
ABCD:   skip? n                     ABCD:   skip? n
ABC:    skip? n                     ABC:    skip? n
AB:     skip? n                     AB:     skip? n
A:      skip? n                     A:      skip? n
B:      skip? n                     B:      skip? y
AC:     skip? n                     AC:     skip? n
C:      skip? n                     C:      skip? y
BC:     skip? n                     BC:     skip? y
ABD:    skip? n                     ABD:    skip? n
AD:     skip? n                     AD:     skip? n
D:      skip? n                     D:      skip? y
BD:     skip? n                     BD:     skip? y
ACD:    skip? n                     ACD:    skip? n
CD:     skip? n                     CD:     skip? y
BCD:    skip? n                     BCD:    skip? y
Done                                Done
[download]

This is because when generating a subtree that contains identical characters, but that is located at a different position in the base set, the order of generation is different, so the skipahead has a different effect. Occasionally, this will result in some saving, but often, as in the case above, none at all, and very rarely (from my analysis), will it achieve the full potential saving.

(*). I've been rapidly reaching the conclusion in my own attempts that there isn't any way to do this.

I've come up with 4 different ways of producing the powersets, each of which produce a different ordering, but none of them allow the potential saving to be achieved because the orderings are dependant upon the positions of the data in the set, not the values in those positions. Ie. * A B C * will never produced the same subtree as A B C * or * * A B C.

Basically, in every generator I've seen or come up with, the values of the data are entirely transparent to the algorithm, as they manipulate positions in the set only. Which means any comparison/decisions about skipping based upon the values of the data are bound to fail.

The only approach I've had any success with is to store not only the subset produced, but also their offset in the original base set (which is messy and complicated), and then only skip if you encounter both situations. The same data at the same offset. Whilst this would work, the reduction in the frequency with which the savings can be achieved, combined with the space requirements (and allocation costs) for storage, and the complexity of the comparisons, means that it's simply quicker to generate the sequence with a fast iterator.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In Section Seekers of Perl Wisdom