|Problems? Is your data what you think it is?|
Removing redundant powersets with minimal RAMby jimt (Chaplain)
|on Nov 03, 2006 at 16:31 UTC||Need Help??|
jimt has asked for the
wisdom of the Perl Monks concerning the following question:
So I've been playing around a lot with Powerset short-circuit optimization and trying to come up with a memory efficient solution. To succinctly state the original problem, given a set of sets, generate the powerset for all sets but remove all duplicates.
Generate the powerset of them (the set of all possible subsets), but eliminate duplicates. So "ABC" is a subset of "ABCD" and "ABCE", but we only want to print it out once. "A" is a subset of "ABCD", "ABCE", and "ABD", but we only want to print it out once, etc.
I'd posted a solution over in that node, but far from the best one. The going best solution stores the powerset in memory, using 64 megs if each set is stored in a byte, or only 8 megabytes if each set is stored as a bit. You march along your list and keep a note of each set you generate. It's fast and a good solution.
Now, I'm not knocking it at all. It's a great solution. But I'm looking at this in the context of a bigger problem, which rapidly scales beyond available ram. Sure, with a 26 element set, you only need 64 megs (I'll make the argument with the byte based approach instead of the bits, but it's the same thing, of course). But 27 elements requires 128 megs. 28 needs 256. 29 needs 512. 32 needs a gig. And so on, a 36 element set would require 16 gigs of memory, which is probably not realistic, though, yes, you can store it and do your lookups on disk. A 36 element set is not unreasonable, since that's a-z + 0-9. We're going to gloss over the fact that a 36 element set would require more than 32 bits to store, so most machines couldn't stuff it into a single integer.
Anyway, so I've been trying to optimize it for memory consumption. Internally, all of these sets are represented as integers, but for clarity's sake in the argument, I'll just type them out as strings. Much easier to read and understand "ABCD" and "AB" than "15" and "9", for instance.
Let's start off with this input file:
My eureka moment was that instead of storing the powerset for each set encountered (pow(ABCD), pow(ABE), pow(ABC), etc.), we could just store the original set, and then check all subsequent sets to see if they're subsets of it.
Now, the problem is, of course, this takes a hell of a lot more work. instead of doing a single check at each set ("Have I seen this set before?") it needs to check and see if the set we're encountering is a subset of any previously seen set. 14 checks in this case. Ouch.
Now, remember, these sets are all stored as integers, so the checks are very very fast, its just a bitwise & and an equality, but doing 14 things quickly still loses out to doing 1 thing quickly. It also doesn't scale. As you encounter more sets, you do more and more and more checks until your CPU catches fire.
I've optimized this in two ways - one is to store the list of sets in sorted numerical order (ints internally, again). So the list would be: (EF, ABE, ABCD) (assuming higher letters are bigger bits). When doing the checks, it stops looking once it encounters a set smaller than itself.
And it bows out much faster, only needing to do 2 checks instead of 3. A B tree or any fancier structure is of dubious utility at this point, since a set could potentially be a subset of any previously encountered one. Order doesn't matter, just "bigger than me" (check it) and "less than me" (we're done)
The final optimization I added was to redundantly store the previously seen sets in separate arrays for each element.
If a set contains a particular element, it goes into that slot. Then, when we encounter set ABF on our next pass, it would need to compare against all elements in the "A" slot, the "B" slot, and the "F" slot.
If we can figure out if there's an element that exists in all three of those sets, then it solves the equation, we know that there exists some set that's a super set, and we're done. For example, if we next encountered "AB", we'd look at the "A" slot and the "B" slot, and we could see that "ABE" (and "ABCD") exists in both slots, therefore, there is a superset, therefore we can stop. This'd be awesome, but unfortunately the only solutions I've come up with effectively require storing the powerset in memory, which negate the original purpose. If I'm going to store the powerset, I'd use one of the earlier proposed approaches that would be faster.
So I don't have an efficient way to do that. To state clearly, this version of the problem is: I have several sets of sets. Rapidly tell me if an arbitrary set is an element of each of those sets. This is effectively the Longest Common Subsequence problem, which is NP Complete.
Instead, I look at the smallest set in the list, and do my iterative checks over that. For "ABF", I only need to compare against the "F" slot, and a single element ("EF). I do this since I could compare against the elements in the "A" slot, the "B" slot, and the "F" slot, but there are the fewest elements in the "F" slot, so I look there. Since "ABF" is not a subset of "EF", we're done. No further checking required.
This works since the only way a solution would exist is if its in all 3 arrays anyway. So we can just check the smallest one to do as little work as possible. We still bow out once we encounter a set smaller than we are, since they're still in sorted order internally.
And here's where I'm stuck. If I could rapidly determine if there exists an element that is in all of the subarrays chosen (in the A, B, and F slots, in this case), I'd be done. But the only approaches I've come up with to do this involve linearly scanning all the lists to build up a new data structure. But this is actually worse, since doing the check at each level only requires a linear scan of the smallest set. A scan through the "F" slot to determine if any element is a superset is faster than counting all the elements in the "A", "B", and "F" slots. This becomes especially clear if the sets are greatly imbalances. Say that there are 4,500 elements in the "A" slot, but only 1 in the "F" slot.
If there exists a reasonably fast way to do this (or, at least one that doesn't get worse as more sets are added), the gains would be tremendous. I'd be thrilled to death with something that's slower for smaller sets but scales well to larger sets. If I take .01 seconds per check, but always take .01 seconds per check, I win. Using the original "store the powerset" approach would require 16 gigs of memory for a 36 element set. Using this approach, you only are required to store (roughly):
I just need a clever way to do the search to reduce the number of checks for the cases where they don't match or there are large sets involved.