perlmeditation
Abigail-II
<h3>Introduction</h3>
It's commonly known that a hash is an appropriate datastructure
if you want to store and modify a set of strings, and you want to
perform <em>matching queries</em>, that is, quickly answer the
question whether something is an element of the stored set.
<p>
There are some drawbacks about hashes. Accessing a key takes time,
as we need to calculate the hash function. Furthermore, many elements
could hash to the same address, which causes chains, and which slow
down queries (and deletes). Furthermore, 'clearing' a hash takes
time linear in the size of the stored set. (Some of the problems have
been fixed in recent perls, but with some overhead when inserting
elements). Implementation is also non-trivial.
<p>
If we know something about the set we are storing, we can sometimes
use datastructures that are faster. For instance, suppose we know that
all possible elements of the set are non-negative integers smaller
than a fixed number <tt>U</tt>. We can then build an array (or a
<tt>vec</tt> string) with <tt>U</tt> elements, say <tt>A</tt>, and
we let <tt>i</tt> be a member of our set if and only if <code>A [i] == 1</code>.
Otherwise, <code>A [i] == 0</code>. Using simple code, we have something like:
<code>
use constant U => ....;
sub init {@A = (0) x U}
sub insert {$A [shift] = 1}
sub delete {$A [shift] = 0}
sub query {$A [shift]}
sub clear {@A = (0) x U}
</code>
It's easy to see that inserting, deleting and querying all take
<tt>O (1)</tt> time. Unfortunally initializing or clearing the structure
takes time <tt>&Theta (U)</tt>.
<h3>A special datastructure</h3>
But we can do better. Consider the following arrays <tt>A</tt>
and <tt>B</tt>, and an integer <tt>C</tt>:
<code>
A B
+------+ +------+
U-1 | ? | | ? |
+------+ +------+
. . . .
. . . .
. . . .
. . . .
. . . .
+------+ +------+
4 | 0 | | 3 |
+------+ +------+
3 | 4 | | 1 | <---- C (== 3)
+------+ +------+
2 | 1 | | 1 |
+------+ +------+
1 | 2 | | 2 |
+------+ +------+
0 | 2 | | 4 |
+------+ +------+
</code>
Now, a number <tt>n</tt> is in our query set, if and only if,
all of the following conditions hold:
<ul>
<li><code>0 <= A [n] < C</code>
<li><code>B [A [n]] == n</code>
</ul>
So, the above set contains <tt>1, 2 and 4</tt>. It doesn't contain
<tt>0</tt> because <code>A [0] == 2</code>, and <code>B [2] == 1</code>.
It doesn't contain <tt>3</tt>, because <code>A [3] >= C</code>.
<p>
Now, how do we insert? Suppose we want to insert a number <tt>i</tt>.
If it's already in the set (that is, the conditions mentioned above
are fullfilled), nothing needs to be done. Otherwise, we store <tt>i</tt>
in <code>B [C]</code>, we store <tt>C</tt> in <code>A [i]</code>, and
we increment <tt>C</tt>.
<p>
Deletion is a bit trickerier, but still straightforward. Suppose we
want to delete <tt>i</tt>. We do that by moving <code>B [C - 1]</code>
to <code>B [A [i]]</code>, as follows: first <code>B [A [i]] = B [C - 1]</code>,
then <code>A [B [C - 1]] = A [i]</code>. And finally we decrement C.
<p>
In code:
<code>
int query (int i) {0 <= A [i] && A [i] < C && B [A [i]] == i}
void insert (int i) {if (!query (i)) {B [C] = i; A [i] = C; C ++}}
void delete (int i) {if (query (i)) {
B [A [ i]] = B [C - 1];
A [B [ C - 1]] = A [i];
C --}}
</code>
Clearing the set is simple, all one needs to do is to set <tt>C</tt> to 0.
<p>
So, we can do querying, inserting, deleting, and clearing in
<tt>O (1)</tt> time. Not bad. But what about initializing?
Well, <em>that can be done in constant time</em> as well. At least, in
low level languages. The key is that we start off with <code>C</code>
being 0, so no matter what is already in <tt>A</tt> or <tt>B</tt>, it
never causes elements to be present. <em>So, we do <strong>not</strong>
need to initialize the arrays</em>. As long as we can claim memory of
appropriate size.
<p>
Hence, we can do all five operation in constant time - as long as the
programming language doesn't insist on initializing.
<h3>Challenge</h3>
Can we create a similar datastructure in Perl? XS of course doesn't count.
<p>
Abigail