P is for Practical PerlMonks

### Data structure challenge

by Abigail-II (Bishop)
 on Mar 17, 2004 at 17:32 UTC Need Help??

### Introduction

It's commonly known that a hash is an appropriate datastructure if you want to store and modify a set of strings, and you want to perform matching queries, that is, quickly answer the question whether something is an element of the stored set.

There are some drawbacks about hashes. Accessing a key takes time, as we need to calculate the hash function. Furthermore, many elements could hash to the same address, which causes chains, and which slow down queries (and deletes). Furthermore, 'clearing' a hash takes time linear in the size of the stored set. (Some of the problems have been fixed in recent perls, but with some overhead when inserting elements). Implementation is also non-trivial.

If we know something about the set we are storing, we can sometimes use datastructures that are faster. For instance, suppose we know that all possible elements of the set are non-negative integers smaller than a fixed number U. We can then build an array (or a vec string) with U elements, say A, and we let i be a member of our set if and only if A [i] == 1. Otherwise, A [i] == 0. Using simple code, we have something like:

```    use constant U => ....;
sub init   {@A = (0) x U}
sub insert {\$A [shift] = 1}
sub delete {\$A [shift] = 0}
sub query  {\$A [shift]}
sub clear  {@A = (0) x U}
It's easy to see that inserting, deleting and querying all take O (1) time. Unfortunally initializing or clearing the structure takes time &Theta (U).

### A special datastructure

But we can do better. Consider the following arrays A and B, and an integer C:
```       A              B

+------+       +------+
U-1 |  ?   |       |  ?   |
+------+       +------+
.      .       .      .
.      .       .      .
.      .       .      .
.      .       .      .
.      .       .      .
+------+       +------+
4  |  0   |       |  3   |
+------+       +------+
3  |  4   |       |  1   |  <---- C (== 3)
+------+       +------+
2  |  1   |       |  1   |
+------+       +------+
1  |  2   |       |  2   |
+------+       +------+
0  |  2   |       |  4   |
+------+       +------+

Now, a number n is in our query set, if and only if, all of the following conditions hold:
• 0 <= A [n] < C
• B [A [n]] == n
So, the above set contains 1, 2 and 4. It doesn't contain 0 because A [0] == 2, and B [2] == 1. It doesn't contain 3, because A [3] >= C.

Now, how do we insert? Suppose we want to insert a number i. If it's already in the set (that is, the conditions mentioned above are fullfilled), nothing needs to be done. Otherwise, we store i in B [C], we store C in A [i], and we increment C.

Deletion is a bit trickerier, but still straightforward. Suppose we want to delete i. We do that by moving B [C - 1] to B [A [i]], as follows: first B [A [i]] = B [C - 1], then A [B [C - 1]] = A [i]. And finally we decrement C.

In code:

```    int query (int i)   {0 <= A [i] && A [i] < C && B [A [i]] == i}
void insert (int i) {if (!query (i)) {B [C] = i; A [i] = C; C ++}}
void delete (int i) {if (query (i)) {
B [A [ i]] = B [C - 1];
A [B [ C - 1]] = A [i];
C --}}
Clearing the set is simple, all one needs to do is to set C to 0.

So, we can do querying, inserting, deleting, and clearing in O (1) time. Not bad. But what about initializing? Well, that can be done in constant time as well. At least, in low level languages. The key is that we start off with C being 0, so no matter what is already in A or B, it never causes elements to be present. So, we do not need to initialize the arrays. As long as we can claim memory of appropriate size.

Hence, we can do all five operation in constant time - as long as the programming language doesn't insist on initializing.

### Challenge

Can we create a similar datastructure in Perl? XS of course doesn't count.

Abigail

Replies are listed 'Best First'.
Re: Data structure challenge
by flyingmoose (Priest) on Mar 17, 2004 at 18:26 UTC
Seems interesting, but you kind of lost me. I'm easy to lose sometimes, so don't take this as a knock on your post. IMHO, this would be more understandable if you used more terms found in the data structure references. Initially, your datastructure looks like it's just pigeon-holing data (unrelated, but see bucket sort), but your example of the 'special' structure is a little confusing. Terms like stack, queue, dequeue, trie, red-black-tree, list, etc -- are all there to facilitate easier communication. Anyhow, as it stands, you've lost me. Does this 'special structure' have a name? What are you trying to emulate? I can sort of grok the intent, but an explanation of the workings and principle is probably in order.

As it stands, all classical operations being O(1) is only doable when your datastructure is as large as the domain space, right? In many cases, this means working on small data sets or with huge amounts of data. Result? It's not used very frequently. I understand this is just a question to see if it's possible, but where would this be used and is there a space to fit it? If one datastructure fit everywhere, we wouldn't have the cool montage of data-structures, of course, so I guess I've made your point :)

Also, the cost of a hash function that maps to a completely unique value (as noted in my previous paragraph) is expensive both in time and space. Maybe Perl is cool and doesn't make an array as wide as it is and can leave them sparse...but can it really? Not sure. Hence this is why hashes collide and tend to be a little slow -- but the hash function is less expensive since collisions are allowed (not speaking Perl, but stuff in general) -- and collapsing down to a smaller value is important.

Initially, your datastructure looks like it's just pigeon-holing data (unrelated, but see bucket sort), but your example of the 'special' structure is a little confusing. Terms like stack, queue, dequeue, trie, red-black-tree, list, etc -- are all there to facilitate easier communication.
The datastructure I describe consists of two arrays, just as I said. No stack, no queue, no dequeue, no trie, no tree, no list. Just an array on integers, that is, U integers equidistant apart. Just the same way you have an array of integers in C. The trick is just that you say my array lies from here to here in memory, and you initially accept whatever values happen to be in that segment of memory. (Although you might use a stack for array B).
Does this 'special structure' have a name?
That I am not sure of. The technique is described in
Kurt Mehlhorn: "Data structures and algorithms 1: sorting and searching" in Brauer, Rozenberg and Salomaa (Ed): EATCS monographs on theoretical computer science. Berlin, Heidelberg, New York, Tokyo: Springer-Verlag, 1984. Section III.8.1
where it is used to implement a bit vector.
What are you trying to emulate?
A bit vector ;-)
As it stands, all classical operations being O(1) is only doable when your datastructure is as large as the domain space, right?
Yes, and I had hoped that was clear from the description.
It's not used very frequently. I understand this is just a question to see if it's possible, but where would this be used and is there a space to fit it?
I've used it once in an academic paper, describing a technique to create a triangulation on a 2-manifold, by applying a sequence of vertex-splits starting from a minimal triangulation. The main result used O (n log n) time (with n the number of vertices of the triangulation), but by using n of the described datastructures, the sequence can be obtained in linear time (as long as you have Θ (n2) scratch space available).

I have no pressing need to implement this in Perl right now. But I would like to know whether it's possible to do something similar in Perl.

Abigail

Thanks for typing this -- this makes it a lot more clear what is going on. Interesting stuff...

Again I apologize for my long list of questions too... Infinite telecoms and project managers will do that to you.

I'll take a better look at this when I get home...

Re: Data structure challenge
by gjb (Vicar) on Mar 17, 2004 at 18:11 UTC

Trouble is caused by initialisation, so what data structure to use that is initialized in constant time? The answer may be to use simple strings for A and B. The length should be U*log_10(U) so that each group of log_10(U) characters represents an "array element". Initialisation can be done using pack that should be constant time, array access can be simulated using substr which is also constant time.

Just my 2 cents, might be totally wrong, -gjb-

Initialisation can be done using pack that should be constant time
Well, that would be interesting. I'm not a pack() wizard, and I can't see how I would do it with pack. What format would I use?

Abigail

Wouldn't a pack(a[U*log_10(U)], 'z') do the trick?

Thanks Aristotle, I was too lazy/huried to check the correct syntax. The length of the string we want to allocate is N = U*ceil(log_10(U)), so I hoped pack(aN, '') would create a string of N '\0' characters using some low level memory allocation function. Unfortunately that doesn't seem to happen, the resulting string consists of N spaces, so there goes constant time... :(

Just my 2 cents, -gjb-

Re: Data structure challenge
by BrowserUk (Pope) on Mar 17, 2004 at 18:36 UTC

Something like this?

Update: Out by one error corrected.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

If amortizing the initialization cost over the lifetime of the structure were allowed, then simple Perl arrays could be used (by not preallocating the size of the array so that the initial creation is still O(1)). Your use of strings 'suffer' from this same 'problem'. So I'm pretty sure Abigail would reject this solution.

Your solution would qualify if you could preallocate the length of the string w/o Perl initializing the contents of the string. This is pretty trivial to do in XS, hence the disqualification of that method.

- tye

I'm not allowed to amortize the initialization over the life of the structure: but what about over the life of the program: is it valid (in the context of this challenge) big array (vec, whatever) at time zero and then manage it using my own allocator?
Just as tye says, the problem is in the initialization. If you assign a value in a vec() string, and it's "out of bounds", perl will zero all the in between values.

That means, that if you were to store just one value, say U - 1, that insertion will cost you Θ (U) time. Which isn't O (1).

Abigail

I thought it was too easy.

So the whole challenge comes down to is there any way to allocate/extend a piece of memory without initialising it, using pure perl?

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Re: Data structure challenge
by BrowserUk (Pope) on Mar 17, 2004 at 21:53 UTC

Are you getting at \$#array = U; ?

```use Benchmark::Timer;

\$T = new Benchmark::Timer;

\$T->start( 'autoviv' );
\$auto[ \$_ ] = 1 for map{ 10**\$_ } 0 .. 7;
\$T->stop( 'autoviv' );

\$#pre = 10**7;
\$T->start( 'prealloc' );
\$pre[ \$_ ] = 1 for map{ 10**\$_ } 0 .. 7;
\$T->stop( 'prealloc' );
\$T->report;

1 trial of autoviv (93.750ms total)

1 trial of prealloc (15.625ms total)

That still doesn't acheive O(1) as the number of links in the list is still N, though it is somewhat quicker than autoviv.

That said, when initialising 16 MB of scalar to spaces only takes 36 milliseconds, this is one of those cases where big O notation hides reality I think.

```use Benchmark::Timer;
\$T = new Benchmark::Timer;

\$T->start( 'scalar'), \$x = ' ' x 2**24, \$T->stop( 'scalar' ) for 1 ..
+10;
\$T->report;

10 trials of scalar (359.375ms total), 35.938ms/trial

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Re: Data structure challenge
by BrowserUk (Pope) on Mar 18, 2004 at 14:36 UTC

There is an out-by-one error in the logic somewhere that I cannot find right now, but otherwise, I think this complies with the rules of the challenge.

```#! perl -slw
use strict;
local \$, = ' ';

my( \$A, \$C, @B ) = ( undef, 0 );
open SPARSE, '+> :raw', \\$A;

seek SPARSE, 4 * \$_[ 0 ], 0;
unpack 'N', \$n;
}

sub writeA{
seek SPARSE, 4 * \$_[ 0 ], 0;
syswrite SPARSE, pack( 'N', \$_[ 1 ] ), 4;
}

sub query{
my \$i = shift;
my \$a = readA( \$i ) || -1;
return 0 unless 0 <= \$a and \$a < \$C;
my \$b = \$B[ \$a ];
return \$b == \$i ? 1 : 0;
}

sub insert{
my \$i = shift;
if( !query( \$i ) ) {
\$B[ \$C ] = \$i;
writeA( \$i, \$C );
\$C++
}
}

sub del{
my \$i = shift;
if( query( \$i ) ) {
my \$a = readA( \$i );
my \$b = \$B[ --\$C ];
\$B[ \$a ] = \$b;
writeA( \$b, \$a );
}
}

sub clear{
\$C = 0;
}

my @numbers = map{ int rand( 2**16 ) } 1 .. 10;

print 'Query before insert';
printf "%5d : %s\n", \$_, query( \$_ ) for @numbers;

insert \$_ for @numbers;

print "\nQuery after insert";
printf "%5d : %s\n", \$_, query( \$_ ) for @numbers;

del \$_ for @numbers[ 0..4 ]; ## Delete half

print "\nQuery after deleting half";
printf "%5d : %s\n", \$_, query( \$_ ) for @numbers;

del \$_ for @numbers; ## Delete them all regardless if they still exist
+s

print "\nQuery after deleting the rest";
printf "%5d : %s\n", \$_, query( \$_ ) for @numbers;

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Update: this is stupid, see BrowserUk's reply on why.

I would not think this is constant time. I know nothing about how spares files are implemented, but I would not think it is such that you can access any byte in constant time. (If the filesystem does not have sparse files, the situation is even worse, as the os must fill the file with zero bytes when seeking past end.)

Okay. Forget the bareword SPARSE, it is just a filehandle, opened (R/W) to an in memory file per perlfunc:open (5.8.x):

File handles can be opened to ``in memory'' files held in Perl scalars via:
```    open(\$fh, '>', \\$variable) || ..

My inspiration was that if you seek on an in-memory file, the string is extended, without being initialised. Of course it isn't if it is going to emulate a file. If you seek to a position in a random access file you don't want everything preceding it to be overwritten.

So, a pure-perl way of allocating a buffer without initialisation.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Re: Data structure challenge
by dragonchild (Archbishop) on Mar 17, 2004 at 21:22 UTC
What about using soft references? I used BrowserUk's harness and changed the engine around a little.

Now, I think that doing this is cheating, a little, because I'm using the symbol table, which is a hash. But, I don't think this is possible in pure Perl, not with the constraints you've placed on it.

------
We are the carpenters and bricklayers of the Information Age.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

I came up with something similar using a hash to emulate a sparse array

But then realised that I might as well just use the hash to start with and save all the bother. Besides, the hash scalaes better. The overhead of allocating 4GB of ram in order to allocates 3 unique 32-bit integers (for example) is far higher than hashing those 3 integers.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Re: Data structure challenge
by toma (Vicar) on Mar 18, 2004 at 06:26 UTC
I am not offering this as a solution to your challenge, but I suggest looking at the thread Compressing a set of integers. I think that I had a similar problem, but I had a few more opportunities for optimization than you have presented.

With help from the others in the thread, I was able to use pack to make a really cool data structure that is similar to a sparse bit vector. My problem had the potential optimization that some bits were more likely than others, and I could reorder.

For the problem I was trying to solve, Table::ParentChild ended up getting written along the way.

For the deployed application, I ended up using a relational database, which performs well enough today in production.

I also suggest, for practical solutions, that you consider algorithms which are O(log(n)), especially those with small constant factors.

It should work perfectly the first time! - toma
Re: Data structure challenge
by iburrell (Chaplain) on Mar 17, 2004 at 19:33 UTC
With Perl, you can make initializing and clearing constant by using autovivication with arrays. Allocate the @A as an empty array. Setting any value will expand the array and set the other values to undef. This is O(N) but it is fast. Clearing is emptying out the entire array. It also has the advantage that the size of the array is the largest value used. Your data structure uses autovivication for the A array so it has the same performance but more overhead.

It is also important to realize the importance of the constants to running time. Preallocating the array pays O(U) cost in initialization but makes each operation more efficient. When inserting N items close to U, this can be significant.

Setting any value will expand the array and set the other values to undef.
Well, yes, but that exactly what the challange wants you to avoid. You don't want to spend time creating undefined values.
Your data structure uses autovivication for the A array so it has the same performance but more overhead.
No. It's important to realize I did not describe the A array in Perl terms. Think C, or even better, assembler. Just declare an array as a chunk of memory where integers are stored. No initialization of the elements allowed - just accept whatever values are in memory.

Abigail

Re: Data structure challenge
by ambrus (Abbot) on Mar 18, 2004 at 07:59 UTC

This is an interesting question.

If I read you correctly, it is like this: you could get an only slightly slower solution by using a hash, or better, a binary tree instead of the uninitialized array A (with a binary tree, you get O(log n) instead of O(1) for each operation). In fact, this is true no matter how you use an uninitialized array, that is, you can always use a hash or a binary tree instead of one.

Notice however, that if you use such a data structure only once, and you use the efficent C implementation, the OS still has to initialize the array when the process gets the memory by calling brk or mmap.

with a binary tree, you get O(log n) instead of O(1) for each operation
Well, only if the binary tree is balanced (and ordered). I am very well aware of the performance levels of a binary tree - if I wanted to settle for O (log n) performance, I wouldn't have posted the challenge. The challenge is to do all operations in O (1) - doing one or more in O (log n) is not a challenge.
Notice however, that if you use such a data structure only once, and you use the efficent C implementation, the OS still has to initialize the array when the process gets the memory by calling brk or mmap.
While some C libraries might initialize the memory claimed with mmap or brk, there's no reason they have to. The OS and/or the C libraries can always be patched that one can claim a segment of memory, and taking it "as is". For the scope of the challenge, you may assume that we run on a platform where memory is given out "as is".

Abigail

When saying binary tree, I thought of a trie actually. I know it's still O(log n) so does not satisfy the conditions, but a binary trie's code is not as complicated as that of a balanced binary tree.

As for initializing memory: the C library certainly won't initialize the whole array when calling malloc. However, on multi-user OS, the system has to initialize all the pages allocated by a process, for security reasons. If it would not do that, you could get vulnerable informations by allocating much memory and searching for daat that other processes freed but hav not zeroed. This does not take much time because 1. malloc tries to reuse the memory freed by a process, so it doesn't have to ask for a new chunk evry time, 2. the os can do it in idle time, so you don't have to wait for it when you allocate the memory, 3. zeroing a chunk of memory is fast. The first reason may not count with very large arrays, as malloc gets thoes with mmap, and when they're freed they're munmapped.

The question you asked still holds however, because using such a structure is good even in such OS'es, as you may use it many time, and you don't have to zero it evry time the same process clears and reuses one. I still don't know if it can be done in Perl.

No, the OS doesn't necessarily have to initialize the array. Many compilers will do that, but all malloc() should be doing is saying "I will be using this space for a while". It's not guaranteed at all what will be there.

------
We are the carpenters and bricklayers of the Information Age.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Data structure challenge
by Anonymous Monk on Mar 17, 2004 at 20:50 UTC

Is this not impossible using the data structures available to us within Perl? If we do not initialize before using a structure, then perl will autovivify, which is nearly just as bad. So to make this work with any kind of array system you would need to bypass perl's autovivification methods. Of course, you could skip any type of complicated data structure and stick with manipulating scalars. Fixed-length fields perhaps? Or somthing silly like that ;)

Re: Data structure challenge
by dragonchild (Archbishop) on Mar 18, 2004 at 13:31 UTC
I've been thinking about this. Obviously, this is do-able in C (and similar languages), as you have C-like pseudo-code in your challenge. But, other than sparse arrays1, what would be practical benefit of this data structure be? If you add in a hashing function to map strings to your set of i, you still have the collision issue.

Are you considering this as a possible implementation of the arrays Perl's hashes use? It would seem kinda wasteful to add an indexing array just to avoid clearing and initializing in O(n) ...

1 I say sparse arrays because in the 2-array method, U doesn't have to be the upper limit of all i. It simply has to be the upper limit of how many i you can have at one time. So, if you know you'll only ever have a max of 10 things, they could be in the set of 2**32, but you would only need 2 10-element arrays and a short as your C.

------
We are the carpenters and bricklayers of the Information Age.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

But, other than sparse arrays1, what would be practical benefit of this data structure be?
Worst-case constant time operations, regardless of how many elements there are in the data structure.
If you add in a hashing function to map strings to your set of i, you still have the collision issue.
Well, yes. Don't do that then. ;-) As I said, it only 'works' if you know certain things about your set - in this case, you know you are only going to store non-negative integers not exceeding U.
Are you considering this as a possible implementation of the arrays Perl's hashes use?
No.
I say sparse arrays because in the 2-array method, U doesn't have to be the upper limit of all i. It simply has to be the upper limit of how many i you can have at one time. So, if you know you'll only ever have a max of 10 things, they could be in the set of 2**32, but you would only need 2 10-element arrays and a short as your C.
In the 2-array method I described, it's essential U is the upper limit of all i, because i is used as an index in the array A.

Abigail

Re: Data structure challenge
by bsb (Priest) on Mar 22, 2004 at 23:34 UTC
Does \$#A = \$#B = U initialize the arrays correctly?

Yes, Perl has to initialize all array elements. Internally, it will store pointers (to SV's), and if Perl wouldn't initialize the array elements, it ends up with random memory addresses, and it has no way of finding out whether the element was initialized or not.

Abigail

As I thought about it, I began to suspect that.

XS of course doesn't count.
Does this cover virtually programming in C via syscall(&SYS_mmap,..) and pack-ing data structures?

Otherwise, I give up. In fact, I give up anyway, I'm not going near syscall.

Create A New User
Node Status?
node history
Node Type: perlmeditation [id://337432]
Approved by BrowserUk
Front-paged by Enlil
help
Chatterbox?
 [karlgoethebier]: No sack of rice fell over in China? [karlgoethebier]: PM is going down... [LanX]: ♪..♫ Don't cry for me Mandarina .......♩..♬ [karlgoethebier]: ...pasta bolded too soft in Eately? [karlgoethebier]: BOILED! [karlgoethebier]: OK, at least a cool song: https://www. youtube.com/watch? v=jU4caMqKssg

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2017-10-21 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My fridge is mostly full of:

Results (270 votes). Check out past polls.

Notices?