http://www.perlmonks.org?node_id=580625


in reply to Re: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

jimt,
I admittedly did a poor job of explaining the nuances of this problem that make it difficult. I am going to attempt to clarify since you went through all the trouble of working on it though you may have already nailed it.

You have N files that have to be processed in a specific order. Each line in each file is a set which needs to have the powerset (minus the empty set) generated for it. Any set that has been previously seen anywhere in any file can be skipped. While the sets inside a given file can be pre-ordered in any way that would assist in optimizations, the files themselves have to be processed in a specific order.

File 1 is a list of our base sets. File 2 is a list of some base sets combined "two at a time". File 3 is a list of some base sets combined "three at a time". This process is repeated for N files. Ultimately, the powerset of the longest string in the last file will encompass every single set previously seen and should only produce a single set (itself).

File1 Output ABC A, B, C, AB, AC, BC, ABC ABD D, AD, BD, ABD AB BC E E File2 ABCE AE, BE, CE, ABE, ACE, BCE, ABCE File3 ABCDE CD, DE, ACD, ADE, BCD, BDE, CDE, ABCD, ABDE, ACDE, BCDE, +ABCDE

I hope this complete example makes it clear what I am trying to accomplish. If your code can do what I describe above, please adapt your example accordingly.

Cheers - L~R

Replies are listed 'Best First'.
Re^3: Powerset short-circuit optimization
by jimt (Chaplain) on Oct 25, 2006 at 21:27 UTC

    This is a friggin' awesome problem. Lots of fun working on it. :-) But I think a solution is at hand.

    My previous solution assumed that there was some sort of external criteria that would determine whether to skip generation of the powerset. Extensive private conversation with Limbic~Region plus his re-statement of the problem made it clear that that's not the case. Existence is the condition AND we need to worry about multiple sets and god knows what else. This will take time to explain.

    Update: Fixed a minor logic glitch that could cause dupes with specially crafted sets. This code should now be complete.

    Update 2: Changed to pack as a little endian long (V) instead of just a long (L) to make cross platform compatible.

    Update 3: Okay, I'm man enough to admit that all of my slick binary logic actually introduced a horrrible inefficiency (looping to see if a set should be skipped). Fixed with a hash lookup instead. Duh.

    Update4: I think I've squeezed as much raw power out of this as I'm going to get. The newest version here requires the set elements be known in advance and it doen't print out the null set. Otherwise? It does work. Comparing against Limbic~Region's code below, mine is indeed still slower. Dang.

      jimt,
      I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
      • Pure perl ver 1 = 343 min
      • Pure perl ver 2 = 288 min
      • Pure perl ver 3 = 251 min
      • Java = 66 min
      • Perl + Inline::C = 3 min

      Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

      Cheers - L~R

        Hi, L~R...

        First the good news: the algorithm seems fine to me, and there's only one language-specific gotcha I found that could have caused you problems. That's very impressive for someone with next-to-no C experience.

        Now the critique: This is hard code to review. The explanation helps, but there are a lot of things that get in the way.

        • Code without any comments at all is lame. Lame, lame, lame. Especially when its A) dense, and B) implements an unusual algorithm.
        • Those initializations wrap in every browser I checked. Do they wrap in yours? Looks like $h$_ amigo! Do you care?
        • While single line for/while/if blocks are a common Perl idiom, they are rare in C code.
        • If the variable has nothing to do with a "bit", don't call it bit. Yes, I know now that's an artifact of having translated Perl code that used a bit vector. But coming in fresh, I didn't, and it threw me off.
        • 4 out of 5 coders surveyed recommend naming variables with meaningful monikers rather than nebulosities like "val".
        • There are no blank lines to break up that big C function into chunks. It would be much easier to digest if it were in paragraphs -- ideally, commented paragraphs, as Damian recommends in Perl Best Practices.

        With all those impediments, the only way this ordinary human was able to get through that and follow it from start to finish was to rewrite it myself. In so doing, I'm happy to report the only thing I found that was really off was this:

        memset(seen, '0', sizeof(seen)); /* ... snip ... */ if (seen[bit] == '1') { return; } seen[bit] = '1';

        In C, '0' != 0 and '1' != 1. Those single quotes return the integer value of the enclosed character, which, for '0' on an ascii machine is not 0 but 48. :) You got away with it because you were consistent, though. :) That memset() should look like this:

        memset(seen, 0, sizeof(seen));
        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
      jimt,
      I am sad to tell you that despite providing an iterative version that need not be called more than necessary, it is terribly slow. I timed (not Benchmarked) 4 versions and unfortunately this wasn't even a contender:
      • java recursive 1 = 12 seconds
      • perl recursive 1 = 13 seconds
      • perl recursive 2 = 28 seconds
      • perl iterative 1 = 6_190 seconds (only 25% done, getting slower, and producing duplicates)

      The code to generate the 3_477 line data file and the recursive java version can be found at How many words does it take?. The two recursive perl versions are below:

      I made minor modifications to your code to handle my dataset as well as produce comparable output:

      Update 1: After your 3rd update, your code finished in a respectable 78 seconds. Unfortunately it is still producing about 40% duplicates. Additionally, it doesn't produce the correct output (missing missing 92_835 strings out of 508_062). For instance 'cdglnst' does not appear at all in your output.

      Update 2: After your 4th update, your code narrowly makes 3rd place with 26 seconds and correct output! I included the entire perl script I am using above to ensure we are comparing apples to apples. Admittedly, yours does scale much better with both speed and memory. Unfortunately, it still isn't quite up to the task I needed. I will have to put this in my back pocket for later though.

      Cheers - L~R