in reply to Powerset short-circuit optimization

Sorry for taking so long to do this. I saw how to do it fairly quickly but didn't get time to write it up.

Fun problem. Figuring out how my solution works is also fun. ;-)

Incidentally the fact that you didn't want the empty set made life easier - I could just use the empty set as my "end of data" metadata marker. If I didn't have that, then I'd have had to use a less convenient API. (I'd have probably just returned array references instead.)

Note that I suspect a naive approach such as the ones above is actually faster.

#! /usr/bin/perl -w use strict; my $iterator = Set::IterateAndSkip->new( @ARGV ? @ARGV : 'A'..'D' ); while (my @subset = $iterator->next) { print @subset, ":\tskip? "; if (<STDIN> =~ /y/i) { $iterator->skip_subsets; } } print "Done\n"; package Set::IterateAndSkip; sub new { my $class = shift; my $set = [map {[1, $_]} @_]; $set->[0][0] = 0; return bless { set => $set, }, $class; } sub end_of_first_run { my $self = shift; my $set = $self->{set}; my $n = -1; for (@$set) { if ($_->[0]) { $n++; } else { last; } } return $n; } sub skip_subsets { my $self = shift; $self->{set}[0][0] = 0; } sub next { my $self = shift; my $set = $self->{set}; my $end = $self->end_of_first_run; if (-1 < $end) { $set->[$end][0] = 0; } else { $set->[0][0] = 1; $end = $self->end_of_first_run; if ($end == $#$set) { if ($self->{started}) { # Reinitialize in case we want to cycle through again. delete $self->{started}; $set->[0][0] = -1; return (); } else { $self->{started} = 1; } } else { $set->[$end][0] = 0; $set->[$end + 1][0] = 1; } } my @result = map {$_->[0] ? $_->[1] : ()} @$set; # Need to skip the empty set. return @result ? @result : $self->next(); }

Replies are listed 'Best First'.
Re^2: Powerset short-circuit optimization
by BrowserUk (Pope) on Oct 04, 2006 at 20:46 UTC

    Unless I've been wasting my time pursuing the wrong goal (which is quite possible), this isn't going to achieve the original aim. I'll try to explain, but sorry if I miss the mark.

    The original post said:

    Using another example:

    Set1: A,B,C,D Set2: A,B,C,E set3: A,B,C,F,G

    They share sets ABC, AB, AC, BC, A, B, and C so why generate them three times?

    Which suggests the idea is to avoid regenerating those subsets that are a part of a second powerset, if you've already generated them for a previous powerset. To that end, you might use your code to generate the first powerset completely (metaphorically answering 'no' to the prompt), and then save a stringified version of each subset generated in a hash. $seen{ "@subset" } = 1

    Then, when generating a second (or subsequent) powerset, each time you get back a subset, you look it up in the hash and use it's existance or otherwise to decide whether to skip the rest of that subset tree.

    skip() if exists $seen{ "@subset" }

    The problem is, for this to work, the code has to produce the same (number and order), subset sequence from any given starting point, but your code does not do this(*).

    Running your code on B C D and answering alternately 'yes' and 'no' at the top level shows what should be skipped when answering yes to that subset:

    C:\test>tilly-powersets B C D C:\test>tilly-powersets B C D BCD: skip? n BCD: skip? y BC: skip? n Done B: skip? n C: skip? n BD: skip? n D: skip? n CD: skip? n Done

    Now, running it on the set A B C D, we should be able to skip the generation of the 6 subsets [BC], [B], [C], [BD], [D], [CD], but when we run the code and answer 'yes' each time we see a sequence we saw during the generation of the B C D powerset, we still end up generating the same number of outputs:

    >tilly-powersets A B C D >tilly-powersets A B C D ABCD: skip? n ABCD: skip? n ABC: skip? n ABC: skip? n AB: skip? n AB: skip? n A: skip? n A: skip? n B: skip? n B: skip? y AC: skip? n AC: skip? n C: skip? n C: skip? y BC: skip? n BC: skip? y ABD: skip? n ABD: skip? n AD: skip? n AD: skip? n D: skip? n D: skip? y BD: skip? n BD: skip? y ACD: skip? n ACD: skip? n CD: skip? n CD: skip? y BCD: skip? n BCD: skip? y Done Done

    This is because when generating a subtree that contains identical characters, but that is located at a different position in the base set, the order of generation is different, so the skipahead has a different effect. Occasionally, this will result in some saving, but often, as in the case above, none at all, and very rarely (from my analysis), will it achieve the full potential saving.

    (*). I've been rapidly reaching the conclusion in my own attempts that there isn't any way to do this.

    I've come up with 4 different ways of producing the powersets, each of which produce a different ordering, but none of them allow the potential saving to be achieved because the orderings are dependant upon the positions of the data in the set, not the values in those positions. Ie. * A B C * will never produced the same subtree as A B C * or * * A B C.

    Basically, in every generator I've seen or come up with, the values of the data are entirely transparent to the algorithm, as they manipulate positions in the set only. Which means any comparison/decisions about skipping based upon the values of the data are bound to fail.

    The only approach I've had any success with is to store not only the subset produced, but also their offset in the original base set (which is messy and complicated), and then only skip if you encounter both situations. The same data at the same offset. Whilst this would work, the reduction in the frequency with which the savings can be achieved, combined with the space requirements (and allocation costs) for storage, and the complexity of the comparisons, means that it's simply quicker to generate the sequence with a fast iterator.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You have nailed it. When I posted this, I originally thought that having each set to be processed in alphabetical sorting the sets by length in descending order would allow for this kind of optimization.

      In examining the problem this post is meant to help solve further, I have realized I will not be able to process the list in descending length even if it would make things work.


      Cheers - L~R

      Let me say what my code does, and what it doesn't do.

      It generates the subsets in the exact order that the original post gave, and does so in a way in which you can easily skip all subsets of the current subset. Furthermore it does this with bounded memory requirements, and with somewhat reasonable performance. However it does not guarantee that it generates all longer sets before generating a short set.

      If you wish to guarantee that it generates all longer sets before generating a short set, then you'll have to explicitly remember all of the skips you were asked to make (with potentially large memory requirements to do that), and the skipping step is going to become very complex. Limbic-Region will have to confirm one way or another, but I think I made the performance tradeoff that he would want.

        I was hoping to guarantee all the longer sets were processed first by ordering them according to length. Unfortunately I have discovered that this will not be possible in the larger problem space this is to help solve.

        I am very happy to have your code though even if I can't use it this time. I am sure it will come in handy in the future or at a minimum - I will learn a lot from it.

        Cheers - L~R