|Welcome to the Monastery|
Re^2: Re-orderable keyed access structure?by BrowserUk (Pope)
|on Aug 14, 2004 at 22:27 UTC||Need Help??|
The ordering is important. I'm using a tied hash representing keyed data on disk. The data is too large to fit in memory and so the tied hash internals read/write to/from disk as required. That's all working fine, but hitting the disk for every read & write has a cost.
To mitigate that cost, I wish to cache some amount of the data. The 'items' in my descriptions are (references-to + state-of) the cached data.
The hash in the description is the internal (to the tie) keyed access to the cache.
The array is an ordered stack that maintains the "timeliness" of the cache. Things nearer the top should be retained longest when possible. Things near the bottom should be discard first. The ordering of the array is it's whole purpose for existing.
Each time an element in the tied hash is accessed, the first thing to do is look to see if it is currently in the cache (the internal hash). If it is, then it can be returned (FETCH) or modified(STORE) as required. Having been accessed, it's position in the stack is adjusted according to some heuristic.
If a LRU algorithm was chosen, then any access would promote the item's tag (reference or key or index) from it's current place in the body of the stack to the top. Everything previously above it moves down one.
When an attempt to access a key in the tied hash results in a miss in the internal cache, the bottom element of the stack is shifted off to free up space for the new item read from disk.
But a standard LRU algorithm isn't everything. If the bottom element of the stack has been updated, then it will need to be written back to disk before being discarded. If there is an unchanged element in the stack above it, it may be cheaper to discard that and leave the modified element in the cache until there are no unmodifed elements left to discard in case another modification for that modified element comes along in the near future.
So, a possible algorithm to use would be to place any newly read-from-disk element at the top of the stack. Any FETCH accesses to an element already in cache promotes it's position in the stack by one place. Any STORE accesses to an existing element in the stack might promote it's position 2 places. In this way, modified elements will tend to be retained longer, after their last access, than unmodified elements.
There are numerous variations upon this scheme that I would like to try. Put new element into the middle of the stack, and then allow their position to be adjusted relative to the accesses made. And so on.
So, the obvious choice is to use an array to maintain the ordering as splice can easily perform all the operations described above. MoveToTop(), MoveToBottom(), promote(N), demote(N). Complete flexibility to tailor the heuristic--even dynamically.
But, splice requires indexes. The data is accessed via keys. There needs to be a way of mapping keys to index so that each time a cache element is accessed, the appropriate adjustment can be made to the stack.
But when an item is shifted off the bottom of the stack, I need to be able to map that item back to it's key, so that it can be removed from the the internal cache-hash.
The overall scheme is somewhat more complicated than that, but hopefully this describes why I need the two types of structure, the two types of access, and why none of the existing caching mechanisms that I've looked at on CPAN will do what I am trying to do.
Placing the items in the array, and storing references to the array elements as the values of the cache-hash, means that keyed access to the items is simply one level of indirection from the key.
It also means that as the positions of the elements in the array get swaped around, the keyed access to them is self-maintained.
The problem creeps in when shifting the last element off the array in order to discard it. The corresponding (internal) hash element needs to be deleted also, but the only way I can see to access it, is to iterate the keys of the hash and inspect the values, until I find one that matches a reference to the element being discarded. That's a slow linear search that I was hoping to avoid.
The best scheme I have come up with so far is to store fixed length (padded) copies of the keys in a string. This allows me to use substr to 'splice' the keys around easily. Locating the index for the key to move requires a linear search, but linear searching a string is much faster than linear searching an array. It also has the upside of requiring less memory for the infrastructure.
The downsides are that the keys must be fixed length. As it happens, for my particular application, the keys are fixed length (3-chars) anyway, but I would still like to avoid the linear search (or structure iteration) if I could. It increasingly seems that this isn't possible.
I'd also have liked the caching mechanism to be generic, which I loose with the fixed-length key restriction. For a few applications, it may be possible to predict the maximum length and pad to achieve this, but for many others, the length of the keys would not be known until runtime.
I apologise for there being too many words, but my attempts to explain it with less have obviously failed dismally :)