in reply to
Re: Re: Performance quandary
in thread Performance quandary
well, if you are working on a 32bit machine, and you read 24bits worth of the md5 hash as your index to the array i get (32 * 2^24) bits of array, unpopulated, or 64MB. assuming you are reading the md5 as an 8bit ASCII string then you get three chars worth of it, which is a bit sad, but most of the 8 bits are redundant for the md5, so if you are lucky you should be able to compress each char to 6bits, via substitution, which gives you a fourth char for the same space. if you can't find four chars that vary enough between md5s (the last four chars?) can you double hash the keys for more variety? SQUID gives you md5, does your end program search by md5? if so you can still store the data by another key, you just have to store the original md5 with the rest of the data to match on.
anyhow, if you run your data structure as a array of lists, with 16 million places in the array to reference, and 1-3 million actual items to index, you are below any load threshhold, so seeks will require on average only one or two compare operations. so long as you can keep the data spread evenly across the array, it should be well worth the memory used. the fact that it would be sparsely populated is of course the goal of what i was suggesting, then you have virtually constant seek time from zero to 3 mil data points.
the total memory cost for this is 64MB for the array itself, again, assuming 32bit pointers, plus the 100-200MB data set you are already dealing with.
on the FETCH/STORE issue, what i meant was: don't build the tree in this program, just feed the data into tables. build the trees in the program that is reading the tables. then you aren't doing a search for parents/children, a store, then another search for the trees, you are just storing the data unprocessed, and doing one search for everything and building the tree then. if you're end user program can be made to build the trees itself you can cut half the searching...