Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Re-orderable keyed access structure?

by Aristotle (Chancellor)
on Aug 14, 2004 at 21:05 UTC ( #383017=note: print w/replies, xml ) Need Help??

in reply to Re-orderable keyed access structure?

This is really in reply to Re^8: Re-orderable keyed access structure?, but I'm breaking the thread structure here to go back one step and ask what it really is that you need to achieve. Maybe, if you state your actual problem, someone can propose a different way of looking at it that would indicate a data structure more directly suited to your ultimate goals.

Re^2: Re-orderable keyed access structure? f.ex makes me wonder whether the actual order is at all important, or whether you simply require it as a corollary of your way of looking at your problem. That description looks vaguely like you might want a heap instead of a list.

But I don't know, so I could be completely wrong on this, and therefor I'm asking for what it really is you're trying to solve.

Makeshifts last the longest.

  • Comment on Re: Re-orderable keyed access structure?

Replies are listed 'Best First'.
Re^2: Re-orderable keyed access structure?
by BrowserUk (Pope) on Aug 14, 2004 at 22:27 UTC

    The ordering is important. I'm using a tied hash representing keyed data on disk. The data is too large to fit in memory and so the tied hash internals read/write to/from disk as required. That's all working fine, but hitting the disk for every read & write has a cost.

    To mitigate that cost, I wish to cache some amount of the data. The 'items' in my descriptions are (references-to + state-of) the cached data.

    The hash in the description is the internal (to the tie) keyed access to the cache.

    The array is an ordered stack that maintains the "timeliness" of the cache. Things nearer the top should be retained longest when possible. Things near the bottom should be discard first. The ordering of the array is it's whole purpose for existing.

    Each time an element in the tied hash is accessed, the first thing to do is look to see if it is currently in the cache (the internal hash). If it is, then it can be returned (FETCH) or modified(STORE) as required. Having been accessed, it's position in the stack is adjusted according to some heuristic.

    If a LRU algorithm was chosen, then any access would promote the item's tag (reference or key or index) from it's current place in the body of the stack to the top. Everything previously above it moves down one.

    When an attempt to access a key in the tied hash results in a miss in the internal cache, the bottom element of the stack is shifted off to free up space for the new item read from disk.

    But a standard LRU algorithm isn't everything. If the bottom element of the stack has been updated, then it will need to be written back to disk before being discarded. If there is an unchanged element in the stack above it, it may be cheaper to discard that and leave the modified element in the cache until there are no unmodifed elements left to discard in case another modification for that modified element comes along in the near future.

    So, a possible algorithm to use would be to place any newly read-from-disk element at the top of the stack. Any FETCH accesses to an element already in cache promotes it's position in the stack by one place. Any STORE accesses to an existing element in the stack might promote it's position 2 places. In this way, modified elements will tend to be retained longer, after their last access, than unmodified elements.

    There are numerous variations upon this scheme that I would like to try. Put new element into the middle of the stack, and then allow their position to be adjusted relative to the accesses made. And so on.

    So, the obvious choice is to use an array to maintain the ordering as splice can easily perform all the operations described above. MoveToTop(), MoveToBottom(), promote(N), demote(N). Complete flexibility to tailor the heuristic--even dynamically.

    But, splice requires indexes. The data is accessed via keys. There needs to be a way of mapping keys to index so that each time a cache element is accessed, the appropriate adjustment can be made to the stack.

    But when an item is shifted off the bottom of the stack, I need to be able to map that item back to it's key, so that it can be removed from the the internal cache-hash.

    The overall scheme is somewhat more complicated than that, but hopefully this describes why I need the two types of structure, the two types of access, and why none of the existing caching mechanisms that I've looked at on CPAN will do what I am trying to do.

    Placing the items in the array, and storing references to the array elements as the values of the cache-hash, means that keyed access to the items is simply one level of indirection from the key.

    It also means that as the positions of the elements in the array get swaped around, the keyed access to them is self-maintained.

    The problem creeps in when shifting the last element off the array in order to discard it. The corresponding (internal) hash element needs to be deleted also, but the only way I can see to access it, is to iterate the keys of the hash and inspect the values, until I find one that matches a reference to the element being discarded. That's a slow linear search that I was hoping to avoid.

    The best scheme I have come up with so far is to store fixed length (padded) copies of the keys in a string. This allows me to use substr to 'splice' the keys around easily. Locating the index for the key to move requires a linear search, but linear searching a string is much faster than linear searching an array. It also has the upside of requiring less memory for the infrastructure.

    The downsides are that the keys must be fixed length. As it happens, for my particular application, the keys are fixed length (3-chars) anyway, but I would still like to avoid the linear search (or structure iteration) if I could. It increasingly seems that this isn't possible.

    I'd also have liked the caching mechanism to be generic, which I loose with the fixed-length key restriction. For a few applications, it may be possible to predict the maximum length and pad to achieve this, but for many others, the length of the keys would not be known until runtime.

    I apologise for there being too many words, but my attempts to explain it with less have obviously failed dismally :)

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      So on the outside, the access is by key only? Then the order doesn't matter at all. It's just your way of looking at the weighting of your items, but that's all your task calls for — not an order, but a way to weight elements. Your single constraint is that there needs to be a fast way to locate the item with the lowest (or highest, depending which way around you want to define it) weight.

      Well, that is indeed what a heap is good for. A heap is a complete binary tree with a simple condition: the parent's value is always greater than either of its childrens' values. That means the greatest value of the tree is always at its root. Leaves only exist at the very bottom of the tree. To insert an element, you attach it to a free slot of any of the leaves. In that case, or if you adjust a child's value such that it violates this requirement, the violation is trivial to fix: you swap it with its parent, which may recursively cause a new violation with regard to its new parent. Since the structure is always a complete binary tree and you have to do this for any operation (either up the tree or down the tree depending on whether you are inserting or removing elements), all operations happen in O( log n ).

      This is obviously faster than your linear-time splices.

      And there is more excellent news for you: since a heap is a complete binary tree, it's easy to map to an array. Just define the children of the element at position i to be the elements at 2i and 2i + 1. Obviously, the parent of an element is int( i / 2 ). That's for a 1-based index, for 0-based index, you need some additions and subtractions to twiddle the numbers appropriately. The tree root is then at 1, its children are at 2 and 3, their respective children are at 4, 5 and 6, 7, the next level of siblings is at 8 - 15, etc.

      In your case, I'd store your actual data in a hash as one would normally do. I'd use an array-as-heap to store hash keys, and another hash to be able to map keys to heap locations. That mapping is bidirectional, so you don't need any linear scans. If you need to remember explicit weights, it's cheapest to do that using another $key => $weight hash — that way you avoid having zillions of tiny anonymous data structures, and keeping two or three hashes with identical key sets in synch is a no-brainer anyway.

      If that was too short and you haven't read Sedgewick's "Algorithms", get it. Actually, if you haven't read it, get it, regardless of whether the above was sufficient. It's a classic and a staple of CS for good reason.

      Makeshifts last the longest.

        I appreciate your having stuck with me through this. Please understand that this is a gut reaction after first reading of your post (and having previously looked at a CPAN Heap implementation), rather than a fully thought through one. I will pursue this further.

        Using a weighted heap, moving an item to the top of the heap requires either:

        1. Giving that item a weight higher than every other item in the heap.

          Which could be done by:

          1. Giving the item being promoted the highest possible weight. No good if you need to do it twice.
          2. Iterating the heap to find the highest weight and adding one. One recursive operation that will then trigger another. (In Perl not C)
          3. Remembering the highest value yet allocated and incrementing it. (Possible, but it still triggers a recursive operation at the perl level.)
        2. Or: Adjusting the weights of every item between the item being promoted to the top, and the current top item.
        3. Then there is the question of how you raise an items weight by 1 or 2 places?
        4. And the complexity (and performance) of recursive reference chasing in Perl code.

        Contrast this with the simplicity of the splice operation.

        Then contrast it with the efficiency of doing the linear operation in C, to recursively dereferencing linked lists implemented using perl hashes or arrays.

        O(N) -v- O(log N) doesn't tell the complete story when the the former is performed in C code and the latter in Perl code. Especially, heavily structured OO Perl code.

        Then take a look at the complexity (and the weight) of something like Heap::Priority.

        The order does matter, because it is my cheap mechanism for retaining and propagating the weightings of the elements. One operation takes care of adjusting all the weights of all the affected items in the stack, because their weights are their positions.

        For now, my "string list" mechanism looks like it will be easy to implement:

        <pre> $order = "ABCDE"; %valByKey { A => \A_value } { B => \B_value } { C => \C_value } { D => \D_value } { E => \E_value } </pre>

        Accessing C via key calls for C to become top item getByKey( 'C' ); does:

        sub getByKey { my $key = shift; ## to top. $order =~ s[(^.*)($key)][$2$1]; ## Needs quotemetaing and adjusting for 3-char keys ## may also require C<no warnings 'undefined'> ## to prevent that when the item being moved is at ## the beginning or the end of the string. ## to bottom ## $order =~ s[($key)(.*$)][$2$1]; ## up one ## $order =~ s[(.?)($key)][$2$1]; ## down one ## $order =~ s[($key)(.?)][$2$1]; ## up two ## $order =~ s[(..)?($key)][$2$1]; ## down two ## $order =~ s[($Key)(..)?][$1$2]; return $valByKey{ $key };; }


        $order = "CABDE"; %valByKey { A => \A_value } { B => \B_value } { C => \C_value } { D => \D_value } { E => \E_value }

        This satisfies all my criteria, it's lightweight, pretty damn fast and easy to program.

        It's only limitation (that I can see?) is the fixed length keys, but generalising that can be an exercise for another time.

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://383017]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2020-02-21 13:44 GMT
Find Nodes?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?

    Results (96 votes). Check out past polls.