Ultra compact set searchingby jimt (Chaplain)
|on Nov 17, 2006 at 16:11 UTC||Need Help??|
I opted to post this as a meditation rather than a SoPW since I'm just relaying back what I've done. But if somebody wants to optimize, that'd rock.
I continued to tinker with Removing redundant powersets with minimal RAM an awful lot. If you're gonna read this node, go read that node (and the one it references) for background info). My original approach included storing a sorted version of all of the sets, from biggest set to smallest set. This would require 65 million lookup checks and ran in about 2.5 minutes. If I omitted the sort, it would drop to requiring only 45 million checks and running in about 1.5 minutes. Go figure. I think I got lucky with the data.
But my main question was, I have several sets of sets. Rapidly tell me if an arbitrary set is an element of each of those sets.. Alternatively, it can be stated as, Given a set of sets, and an arbitrary additional set, rapidly tell me if that arbitrary set is a subset of any set in my set of sets.
I have made progress towads a solution, but incredibly ended up optimizing even further for RAM, not speed. Whoops. I'm posting the progress here in case the techniques prove useful for anyone else. There's a potentially useful code generation technique about 3/4s of the way down. I can be a bit verbose.
Okay, now let's jump right into it. Let's say this is my data:
I want to quickly determine if BG is a subset of any of those sets I have. It turns out, it is, it's a subset of ABFG. But, I need to linearly run through all of the sets until I get to it, and it happens to be the one at the end. Not bad in this case, with only 5 sets, but if you had, say, 5,000 involved, it gets pricey. Especially if you need to loop through multiple times.
I was tempted to try sorting. Since they're all internally integers, this is fast and easy. Except for the fact that it's flawed. You don't want to sort on the size of the integer, you would want to sort on the number of elements in a set. So 01111 actually comes before 10000, since it has more elements. The goal was to put larger sets first, since they'd be more likely to contain a hit.
Unfortunately, I didn't get good results with it. Sure, some of the time, it helps out, but since the sets are arbitrary, you'll also end up sorting small sets to the end of the list. If those small sets contain the elements you need, you may have just increased your amount of work. Someone may hae a more clever technique.
I went with a different approach, trying to get constant time. Or, at a minimum, better performance.
The tricks I employed are to reduce the search space. In my previous post, I explained how I'd implemented a multi-level array to store the elements based upon what they contained, so you ended up with slots:
But I'm never happy. I wanted faster. I wanted a way to look at the sets and figure out if anything matched. I'll segue to binary again.
This was the eureka moment. Each horizontal column contains the list of sets that each element is in. So if you take any two elements and bitwise & them together, you can see if there's an element that's in multiple sets. Again, I'll use the same trick as before to whittle down my search.
Now you & those two integers together. 11001 & 00011 = 00001. This is a true value (not the null character). Therefore, we know that BG is a subset of some previous set. It doesn't matter which one, and for this problem, we don't care. Existence is enough.
You, the human, can read it and see that the value is 00001, so that means it's the 5th set that is a superset, which is ABFG. You've just dropped down to needing just a single check to determine if your arbitrary set is a subset of the 5 previous ones. Naturally, this is bounded by the size of your ints (normally 32 bits. 64 if you're lucky!). In that case, I build an array of ints and check each set in turn.
Now, that anding was a bitch. Conceptually, it's easy to see that I need to & together two ints in this case. But the next time I may need to & together three ints. And the next time 15 ints. And so on. I didn't want to go looping over the structure and & each value together in turn, so instead I wrote up this little gizmo to generate a function to & together arbitrary elements.
Performance is a bit twitchy, though. In benchmarking, iteration over the elements was faster with very small sets, less than 10 elements. Then this approach became faster than iteration for sets up to about 4,000 elements. Then iteration became faster again after that. I assume there's some sort of internal limit that gets hit, but dunno for sure.
And the results of this long lecture? Bupkiss. Well, not what I wanted, at least. Despite all my efforts to reduce the number of checks, it didn't help.
Now, the number of checks total were greatly diminished. Only 14 million vs. 45 million in the previous version (the version I'd originally written back in the original thread only requires 2.5 million), but the check was more complicated (a big & statement across multiple values vs &ing two values.
And the average number went up. My previous worst case was 833 checks, and now it's about 120. But I need to check a lot more elements 120 times than I needed to check 833 times in the prior version. So the runtime is the same, a little over 2 minutes.
What I accidentally did was dramatically reduce the ram. For the data set that Limbic~Region provided, my last version required storing 32,000 some odd integers (note my last post has a typo, where I referred to it as 32,000 bytes). This version? It only needs to store 3,200 integers. That's 1/10th the ram. So I can do the full on huge example of 10,000,000 sets with 24 elements each in only 100 megs of ram (again, my original post was in error - it's not 240megs, it's 960. Assuming minimal int size and no other perl funniness, which I know is inaccurate.
I can clean up and post the code if anyone's interested in pursuing this further. I'm still going to monkey with it. I feel like it's very close to a breakthrough, but haven't figured it out yet.