well, if you are working on a 32bit machine, and you read 24bits worth of the md5 hash as your index to the array i get (32 * 2^24) bits of array, unpopulated, or 64MB. assuming you are reading the md5 as an 8bit ASCII string then you get three chars worth of it, which is a bit sad, but most of the 8 bits are redundant for the md5, so if you are lucky you should be able to compress each char to 6bits, via substitution, which gives you a fourth char for the same space. if you can't find four chars that vary enough between md5s (the last four chars?) can you double hash the keys for more variety? SQUID gives you md5, does your end program search by md5? if so you can still store the data by another key, you just have to store the original md5 with the rest of the data to match on.
anyhow, if you run your data structure as a array of lists, with 16 million places in the array to reference, and 1-3 million actual items to index, you are below any load threshhold, so seeks will require on average only one or two compare operations. so long as you can keep the data spread evenly across the array, it should be well worth the memory used. the fact that it would be sparsely populated is of course the goal of what i was suggesting, then you have virtually constant seek time from zero to 3 mil data points.
the total memory cost for this is 64MB for the array itself, again, assuming 32bit pointers, plus the 100-200MB data set you are already dealing with.
on the FETCH/STORE issue, what i meant was: don't build the tree in this program, just feed the data into tables. build the trees in the program that is reading the tables. then you aren't doing a search for parents/children, a store, then another search for the trees, you are just storing the data unprocessed, and doing one search for everything and building the tree then. if you're end user program can be made to build the trees itself you can cut half the searching...
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||