Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Re: Performance quandary

by SwellJoe (Scribe)
on Feb 26, 2002 at 15:01 UTC ( #147588=note: print w/replies, xml ) Need Help??

in reply to Re: Performance quandary
in thread Performance quandary

Interesting thoughts Xanatax,

From an academic standpoint (since I'm writing this program to learn as much as to solve a problem, that definitely makes the academic valuable to me), I like the idea of playing with different data structures--even hugely different, such as an array in memory, indexed by part of the MD5. There are at least two problems I can see with this approach (I'm not wholly illiterate on data structures and indexing, and other fun things--I am a Squid nerd, after all), and I'd welcome thoughts or pointers to docs on solving them:

We can't have 32 bytes worth of array entries, so we'd index on a part of the MD5. Even so, to guarantee uniqueness we need at least a few bytes--which leads to an extremely huge, but very sparsely populated array. Not ideal, even if memory is not an issue--I think memory would become an issue if we're using an array /that/ big. So how to handle sparse data structures of this sort?

Persistence and concurrency. We have to be able to read the data from another process (the indexing itself is not the end, it is merely the means). So how to dump this out, in near realtime, in a form that is quickly accessible to another process? We don't need to modify the data from the other process, just read it.

As for the no-FETCH, STORE everything proposal...I don't get it. I think it is missing the vital parent->child relationship component, if I understand you. I have considered deleting the parent each time and rewriting with the new child, but I suspect that is an optimization that the BDB handles already (and probably better than I can in Perl--since it is probably smart enough to know when it can insert into the existing 4kb space and when it will have to expand the space).

Replies are listed 'Best First'.
Re: Re: Re: Performance quandary
by Xanatax (Scribe) on Feb 28, 2002 at 09:54 UTC
    well, if you are working on a 32bit machine, and you read 24bits worth of the md5 hash as your index to the array i get (32 * 2^24) bits of array, unpopulated, or 64MB. assuming you are reading the md5 as an 8bit ASCII string then you get three chars worth of it, which is a bit sad, but most of the 8 bits are redundant for the md5, so if you are lucky you should be able to compress each char to 6bits, via substitution, which gives you a fourth char for the same space. if you can't find four chars that vary enough between md5s (the last four chars?) can you double hash the keys for more variety? SQUID gives you md5, does your end program search by md5? if so you can still store the data by another key, you just have to store the original md5 with the rest of the data to match on.

    anyhow, if you run your data structure as a array of lists, with 16 million places in the array to reference, and 1-3 million actual items to index, you are below any load threshhold, so seeks will require on average only one or two compare operations. so long as you can keep the data spread evenly across the array, it should be well worth the memory used. the fact that it would be sparsely populated is of course the goal of what i was suggesting, then you have virtually constant seek time from zero to 3 mil data points.

    the total memory cost for this is 64MB for the array itself, again, assuming 32bit pointers, plus the 100-200MB data set you are already dealing with.

    on the FETCH/STORE issue, what i meant was: don't build the tree in this program, just feed the data into tables. build the trees in the program that is reading the tables. then you aren't doing a search for parents/children, a store, then another search for the trees, you are just storing the data unprocessed, and doing one search for everything and building the tree then. if you're end user program can be made to build the trees itself you can cut half the searching...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://147588]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2018-05-25 22:52 GMT
Find Nodes?
    Voting Booth?