Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: Performance quandary

by Xanatax (Scribe)
on Feb 25, 2002 at 01:26 UTC ( #147240=note: print w/replies, xml ) Need Help??

in reply to Performance quandary

From your previous question I got:

Each object entry requires at least one FETCH (to see if the object already exists), and one STORE...

My question is, Is this necessary? I don't know what you are trying do accomplish, but if memory is not an issue, and keeping up with SQUID is, then you want as fast a process as possible writing to your DB as SQUID runs. I haven't noticed if you've said anything about the time constraints on actually using this data... who or what is querying the datastructure after you build it? do they care if it takes longer to query? do they care if the order of complexity for a query is increased?

if not, then we can fix you in no time... just don't do the FETCH op. STORE everything, and when you go to read the data (A) query all the lines you need, (B) delete them, (C)build your tree and give it to the user, and (D) re-insert the tree while the user is playing with it.

also, you potentially could change the datastructure... if you can generate an md5 in constant time, and memory is not a problem, you could generate an array of the length of possible checksums, then you could read and write to it in constant time. more realisticly if memory is not an issue you could play with very large arrays, indexed on the first N chars of the checksum...

point is, instead of building something that scales up well, you could build something that doesn't scale at all, and is just always going to be bigger than your dataset. that'll always win for speed.

Replies are listed 'Best First'.
Re: Re: Performance quandary
by SwellJoe (Scribe) on Feb 26, 2002 at 15:01 UTC
    Interesting thoughts Xanatax,

    From an academic standpoint (since I'm writing this program to learn as much as to solve a problem, that definitely makes the academic valuable to me), I like the idea of playing with different data structures--even hugely different, such as an array in memory, indexed by part of the MD5. There are at least two problems I can see with this approach (I'm not wholly illiterate on data structures and indexing, and other fun things--I am a Squid nerd, after all), and I'd welcome thoughts or pointers to docs on solving them:

    We can't have 32 bytes worth of array entries, so we'd index on a part of the MD5. Even so, to guarantee uniqueness we need at least a few bytes--which leads to an extremely huge, but very sparsely populated array. Not ideal, even if memory is not an issue--I think memory would become an issue if we're using an array /that/ big. So how to handle sparse data structures of this sort?

    Persistence and concurrency. We have to be able to read the data from another process (the indexing itself is not the end, it is merely the means). So how to dump this out, in near realtime, in a form that is quickly accessible to another process? We don't need to modify the data from the other process, just read it.

    As for the no-FETCH, STORE everything proposal...I don't get it. I think it is missing the vital parent->child relationship component, if I understand you. I have considered deleting the parent each time and rewriting with the new child, but I suspect that is an optimization that the BDB handles already (and probably better than I can in Perl--since it is probably smart enough to know when it can insert into the existing 4kb space and when it will have to expand the space).

      well, if you are working on a 32bit machine, and you read 24bits worth of the md5 hash as your index to the array i get (32 * 2^24) bits of array, unpopulated, or 64MB. assuming you are reading the md5 as an 8bit ASCII string then you get three chars worth of it, which is a bit sad, but most of the 8 bits are redundant for the md5, so if you are lucky you should be able to compress each char to 6bits, via substitution, which gives you a fourth char for the same space. if you can't find four chars that vary enough between md5s (the last four chars?) can you double hash the keys for more variety? SQUID gives you md5, does your end program search by md5? if so you can still store the data by another key, you just have to store the original md5 with the rest of the data to match on.

      anyhow, if you run your data structure as a array of lists, with 16 million places in the array to reference, and 1-3 million actual items to index, you are below any load threshhold, so seeks will require on average only one or two compare operations. so long as you can keep the data spread evenly across the array, it should be well worth the memory used. the fact that it would be sparsely populated is of course the goal of what i was suggesting, then you have virtually constant seek time from zero to 3 mil data points.

      the total memory cost for this is 64MB for the array itself, again, assuming 32bit pointers, plus the 100-200MB data set you are already dealing with.

      on the FETCH/STORE issue, what i meant was: don't build the tree in this program, just feed the data into tables. build the trees in the program that is reading the tables. then you aren't doing a search for parents/children, a store, then another search for the trees, you are just storing the data unprocessed, and doing one search for everything and building the tree then. if you're end user program can be made to build the trees itself you can cut half the searching...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://147240]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2018-01-22 06:25 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (232 votes). Check out past polls.