|laziness, impatience, and hubris|
RFC: The Lazy Manager's Calendar with Inversion Listsby moritz (Cardinal)
|on Jun 07, 2011 at 12:03 UTC||Need Help??|
A nice tool in the toolbox of my data structures and algorithms are inversion lists or skip lists. They handle sets with ranges of integers, but can easily be expanded to ranges of anything sortable, for example dates or strings.
Example: The Lazy Manager's Calendar
The peers of the Lazy Manager have problems getting appointments with the Lazy Manager, because of his extensive vacations. So he needs a calendar application. But since he has so many days of vacation, storing a boolean value for each day is a waste of resources.
Instead it is enough to store the start of each period of vacation, and the start of each period of work again.
Since dates are a bit less simple to deal with, we will encode each day as a positive integer, for example counting from the Lazy Manager's first day of work.
Instead of 30 numbers we only need to store 8 - two for each period of vacation. Each array element with even index is the start of a vacation period, those with odd indexes store starts of work periods.
The Stressed Programmer wants a bigger machine for developing, but the Lazy Manager needs to sign it off, so the Stressed Programmer needs to apply for a meeting with Lazy Manager. So he needs to find out if possible meeting dates fall into the Lazy Manager's vacation or not.
So he searches in the skip list for the largest integer equal to or smaller than the day in question, and if the index of the found number is even, it's a vacation day.
Of course that's not the most efficient way to search the array @vacation_skip. Since the array is sorted, a binary search works much faster for longer arrays. But I'm not going to show one, because Set::IntSpan::Fast does that already, and actually all the other operations mentioned here (see the source of Set::IntSpan::Fast::PP for the actual implementation).
Detection of Overlaps
The company wants to organize a conference, and it should be three days in which the Lazy Manager is not on vacation. When probing possible dates for the conference, one could check every day of the three-day span, but there is a more efficient way:
For the start and the ende of the range that we want to probe, we search for the closest number equal to or smaller than the start and end position each. If both these lookups find the same index in the inversion list, the whole probe range is completely inside a vacation interval if the index is even, and completely inside a non-vacation interval if the index is odd.
If the two lookups produce different indexes, the probe interval covers both vacation and non-vacation, and thus overlaps with a vacation period (under the assumption that no empty ranges are in the skip list).
The Lazy Manager has a proxy who needs to be available whenever the Lazy Manager is on vacation, and in turn can take vacations during Lazy Manager's work days.
The proxy's vacation table is just the inverse of the Lazy Manager's vacation table. The easiest way to invert a skip list is to add a leading zero:
All even indexes in @vacation_skip end up being odd indexes in @proxy_vacations, and the other way round.
On a second inversion it is helpful to remove to the two leading zeros, so that the arrays don't grow needlessly.
The ease of inverting such a list has lead to the name inversion lists, and makes them well suitable for implementing character classes in regular expressions. (For ASCII or Latin-1 a simple bitmap is usually faster, but for the full Unicode range that might take up too much space).
Adding A New Range
The Lazy Manager decides to take another vacation, from day 19 to day 25.
First we need to find out if the new range overlaps with existing vacation ranges.
If there is no overlap, the new start and end markers just needs to be inserted in the right position to preserve the ordering of the list. If the new range is completely inside an existing vacation range, nothing needs to be done.
In the example discussed here, the end of the new range falls into an existing range:
One possible way to add thew new range is to expand the existing range at the end point (skip list entries 25 and 30) to the new range (skip list entries 19 and 30), and delete all skip list entries before the new starting point that are higher than the new starting point.
Unions and Intersections
The union of two vacation calendars shows days where either of the owners is on vacation.
Calculating the union of two inversion lists is quite simple: iterate over the ranges in one of the lists, and add each range to the other one.
By DeMorgan's law, you can calculate the intersection as intersection(a, b) = invert(union(invert(a), invert(b))). Since inversion is such a cheap operation on inversion lists, implementing intersection in terms of union and inversion is usually a good solution.
Variations on Inversion Lists
There are other applications for structures similar to inversions lists, all have in common that ranges are only stored by their start- and end points.
Here are some ideas for inspiration:
Picking items with defined probability
Suppose you want to randomly pick some items from a list, but not all with the same probability -- each item has a weight, and the probability for picking one item should be its weight, normalized to the total of all weights.
One efficient approach is to create a second list of running sums over the weights:
For randomly picking an element, we just need a random number between zero and the sum of the weights (conveniently available in $running_sum), and then find the index of the smallest element in @summed_weights which is equal to or larger than the random number. Again this can be done efficiently with a binary search.
Finding the Number of Overlapping Ranges
About a year ago a fellow monk asked for a fast way to calculate how often multiple ranges overlap. Again something similar to inversion lists can help. Of course overlaps can't be removed during storage, all start points and end points need to be stored.
So the even/odd distinction for the index can't tell us anymore if a given number is a start or end of a range. If all numbers are positive, one can for example make all end points negative (and sort by absolute value). If negative values can appear in regular input, some other markers must be used, for example a second array or a bit map.
If you have any ideas for improving this Tutorial, please let me know.