good chemistry is complicated,
and a little bit messy -LW
Re: A (memory) poor man's <strike>hash</strike> lookup table.by demerphq (Chancellor)
|on Dec 01, 2003 at 13:20 UTC||Need Help??|
Over the weekend I did a little work on this. I loaded your million numbers into a digit trie (a Patricia Trie restricted to the digits 0-9) optimized for reduced memory consumption. The records loaded in around 100 seconds on a 2.2 gighz machine (if temporary space is not an issue this time drops dramatically), and ultimately required less than 5MB to store. I calculated the ratio of bytes used to store the data versus the sum of lengths of the keys and it was 0.62, meaning that the data structure requires 0.62 bytes to store one byte of key. Lookup time in the structure is comparable (or better) than a raw hash (I will admit the implementation of the lookup is done in Inline::C/XS).
Note that this Trie implementation may not be optimal. Because it stores the keys as digits strings instead of byte encoding the integer and then doing a byte-trie its probably less efficient a trie implemented as the latter. You could probably calculate the storage requirements but I have to admit I cant be bothered, however I will hazard a guesstimate that it would be around 2.5MB. Also this structure would be even faster to do lookups on (the issue of packing the integers as necessary before lookup aside).
Also, these calculations are _purely_ about storing the keys. Not about storing the key along with a value. Assuming this was required and the values to be stored were long ints you could add 4MB to both numbers (storing them as packed longs). This would mean the digit trie required total 8.5 mb and the byte trie would come in at 6.5MB. (Actually im not bothering to calculate the size of an SV here. Presumably you would need a small amount of ram, say (a very generous) 1k for the object that holds the two char arrays.)
Which actually brings up a different point. Instead of storing your million numbers in an array you could store them in a packed string. This would mean the total memory requirement would be 4mb, and would still facilitate binsearching. Finger points into the data may speed that search up drammatically.
The discussion with Judy arrays led me to rereview the subject. Judy arrays are going to be slightly less efficient than a pure byte-tree in terms of storage. I think it would be safe to say that for data packed this densely the minimum bound for memory utilization (without bringing compression into the picture) is going to be the numbers you get from a Trie. However its worth noting that this scenario is realtively unrealistic. With lower density data sets my understanding from what ive read on Judy arrays would suggest they would be much more efficient.
I wanted to say that I found this thread thought provoking and interesting and I hope you post more along its line. The subject of optimisation techniques is an interesting one that is IMO useful and worth discussing. All the old caveats about optimizing are correct, but the fact comes that once you identified an optimization target having discussion and documentation of ways to handle it can be very useful.
PS: Its worth noting that conceptually a Trie is essentially a non cpaturing DFA regex engine anchored at the beginning of the string. Its not difficult to extend a trie to handle Kleene closure (*) and one-or-more (+), and only slighlty more difficult to make it handle non anchored (^) searches as well. The advantage of this is that its more efficient for matching many constant strings than the NFA regex engine that perl uses. (Although of course with serious deficiencies.) Ive long wished that perl handled this internally, allowing for a modifier to cause a regex pattern to be treated as a DFA an not as an NFA, which would essentially mean putting a trie engine into perl.
Its weird that these things are related (who would guess there is a direct family tree relationship between regular expressions and judy arrays!) In fact you could take this point further and argue that really there is a relationship between linked lists (degenrate binary trees) and regular expressions! Woo-hoo! Talk about 6 degrees of seperation. ;-)
First they ignore you, then they laugh at you, then they fight you, then you win.