Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re-orderable keyed access structure?

by BrowserUk (Pope)
on Aug 14, 2004 at 02:36 UTC ( #382884=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have a collection of items. They must be accessible by key (which says hash), but also by position (which says array).

This begins to sound like an Tie::( DxHash/IxHash/InsertOrderedHash ), but it's not.

I need to be able to:

  1. re-order the items in the array, but retain the keyed access from the hash.
  2. alter an items position in the array, having located it by key.
  3. discover the items key, having located it through array indices.
hash array +---+---+ +---+ | A | | -\ /---> | | 0 +---+---+ \ / +---+ | B | | ---/-\ /-> | | 1 +---+---+ / \ \ +---+ | C | | -/ / \-> | | 2 +---+---+ / \ +---+ | D | | -\ / \-> | | 3 +---+---+ \ +---+ | E | | -/ \-----> | | 4 +---+---+ +---+

If I use the indices for the array elements as the values of the hash, I fail requirement 1).

If I use references to the array elements as the values of the hash, 1) is satisfied but 2) & 3) aren't.

If I store the data as the value in the hash, and put a copy of the key in the array, then I satisfy 1) & 3), but not 2).

So my question is: Is there some method method of satisfying all 3 requirements without resorting to a linear search of either the array or the hash values?

It feels as though one extra level of indirection is required, but I can't see where to put it.

If it makes a difference, the items themselves are already references to arrays containing several items, and I'd prefer to avoid to many further levels of indirection.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re: Re-orderable keyed access structure?
by Aristotle (Chancellor) on Aug 14, 2004 at 03:55 UTC

    You don't need another level of indirection, you need to make your indirection bidirectional. Besides the actual array storing the values, use a hash to look up positions from keys and an array to look up keys from positions. You just need to be careful about updating the bookkeeping structures in the right order.

    Makeshifts last the longest.

      Could you elaborate this a bit?

      If I have an array of indices to the positions of the items in another array. When one of the items moves from the middle to the begining or end of it's array, all the indices for the items between it's old position and the end to which it has moved change. I'm not sure what that buys me?

      I think if I could get (readonly) references to the hash keys, not just the hash values it might allow bi-directional indirection. But I don't know how to do that.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        No, no. I mean this:

        my @actual_data = ( [ ... ], [ ... ], [ ... ], ); my %index_of = ( foo => 0, bar => 1, baz => 2, ); my @key_for = qw( foo bar baz ); my $lookup_by_key = $actual_data[ $index_of{ 'bar' } ]; my $lookup_by_idx = $actual_data[ 1 ]; my $key_from_idx = $key_for[ 1 ];

        When you move things around in @actual_data, you need to the move the keys around in exactly the same fashion in @key_for. You also need to collect all keys whose values changed positions and update them in %index_of accordingly. To just swap two elements, this is trivial:

        sub swap_elements_by_idx { my ( $i, $j ) = @_; @actual_data[ $i, $j ] = @actual_data[ $j, $i ]; @key_for[ $i, $j ] = @key_for[ $j, $i ]; @index_of{ @key_for[ $i, $j ] } = @index_of{ @key_for[ $j, $i ] }; }

        Reformulating this for atomic operations involving more than two elements is trickier, since the order of operations is then significant, but shouldn't be very hard either.

        Makeshifts last the longest.

Re: Re-orderable keyed access structure?
by kvale (Monsignor) on Aug 14, 2004 at 03:12 UTC
    If the array is kept sorted, for some appropriate comparison operator, then searches for keys in that array are all O(log N). Hash lookups are also typically logarithmic, so it seems that array + binary search would do what you want: reorder the array as needed while still locating keys with good efficiency.

    -Mark

      The ordering of the array isn't dependant upon the values of the referenced items. The positions change purely on the basis of theway in which they are accessed--kind of like a least-recently used or most recently used algorithm--so there is no sort order through which to perform a binary search. That I can see anyway.

      Each time an item is accessed, via it's key, it's position may be changed. Conversely, if an attempt is made to read a new item via its key, and it isn't already in the array, the top item may be popped off the array to free up room for the new item to be added. When this is done, the old item's hash reference must be nulled or deleted.

      Hence the need to able to discover the hash key via it's array index, and it's array index via it's hash key.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Re-orderable keyed access structure?
by etcshadow (Priest) on Aug 14, 2004 at 03:39 UTC
    How about just a hash of [$array_index, $payload] and an array of hash keys? Granted, a lot of operations would require hitting both the array and the hash... but big deal, abstract them behind functions or methods.

    Or did I miss something (quite likely)?

    ------------ :Wq Not an editor command: Wq

      Everytime an array element changed position, not only it's index changes, but a lot of--and sometimes all--other array elements indices change also. That would require iterating the hash (via the array of keys) to adjust all the embedded indices.

      It's not the complexity that is the problem, it is the time element.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        So you want to be able to find the index by searching on the key (and you want this to be fast, i.e. not scanning an array)... this indicates that at some level you can find the array index from within the value of hash by key. But you also want to be able to update all of the indexes without iterating said hash? That seems like a contradiction. I must be misunderstanding something.
        ------------ :Wq Not an editor command: Wq
Re: Re-orderable keyed access structure?
by CombatSquirrel (Hermit) on Aug 14, 2004 at 09:12 UTC
    Maybe this does what you want: I'm in a bit of a hurry now, but it maintains both a hash and an array of array references whith item, array index and hash key in it. The data structures don't have to be moved, which is an advantage for large data structures, but more than one data item has to be modified when items are swapped. I hope the code explains the rest.
    Hope this helped.
    CombatSquirrel.

    Entropy is the tendency of everything going to hell.
Re: Re-orderable keyed access structure?
by ikegami (Pope) on Aug 14, 2004 at 09:48 UTC
    Here's a solution. Add methods to your liking.
      You could also wrap this in a native interface:
      my $obj = new BrowserUKStruct; tie my %hash, "BrowserUKHash", $obj; tie my @array, "BrowserUKArray", $obj;
      It should be quite easy, with all the tie methods simply translating.

      You'd have to decide on a proper interface though, like is what you set also what you get (is $hash{key} an array reference of index and value?). You could also decide, for simplicities sake that the hash and array interfaces are read only, and modify the object directly.

      I think this is a wonderful example of where perl fscking kicks a**. The ability to glue an arbitrary store, like an object, into a hash and array interface, simultaneously is simply superb. Moreover, this is a good example of what tie comes to solve. By giving you the possibility of treating your hash-like and array-like data as real hashes and arrays, you learn a good lesson about the value of abstraction, and polymorphism. Well, I did, at least, when I was obsessed with tie.

      -nuffin
      zz zZ Z Z #!perl
Re: Re-orderable keyed access structure?
by Limbic~Region (Chancellor) on Aug 14, 2004 at 14:17 UTC
    BrowserUk,
    My module Tie::Hash::Sorted does most of what you are asking for, albeit some stuff is "under the covers". You are welcome to abuse it in any way you want for your purposes or you can send me a feature request and I will do it when I have a chance.

    Cheers - L~R

      Thanks L~R, but it you take a look at 382970, you'll see that the "ordering" isn't sorting.

      The re-ordering occurs when a hash element is accessed, according to a hueristic that includes (for example), if a hash element is required that isn't currently being held in the (internal) array, then the lowest ordered item in the array may be discarded to accomodate the called for item. However, an item's position in the array is adjusted each time it is accessed (always by key).

      If an item has been STORE'd to, the underlying item will have been modified and so discarding it from the array will require an update to the backing store first.

      If the item has only ever been FETCH'd, discarding it is cheaper as the backing store doesn't need an update.

      The hueristic can also take into account the frequency of access and/or the recentness(better word?) (or otherwise), of it's last access.

      As you can see, this doesn't easily translate into a sort criteria that could be used with your module. (Unless you can see a way :).

      Thanks again.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        BrowserUk,
        Ok, in light of this further explanation - Tie::Hash::Sorted isn't for you. To be honest, I only read the root node in the thread which didn't include these details. I do want to point out that sorted can be equal to ordered if the sort routine uses a lexical that defines that order. That is one of the great things about the module - the sort routine can be anything you dream up to include defining the specific order 1 by 1.

        Cheers - L~R

Re: Re-orderable keyed access structure?
by Aristotle (Chancellor) on Aug 14, 2004 at 21:05 UTC

    This is really in reply to Re^8: Re-orderable keyed access structure?, but I'm breaking the thread structure here to go back one step and ask what it really is that you need to achieve. Maybe, if you state your actual problem, someone can propose a different way of looking at it that would indicate a data structure more directly suited to your ultimate goals.

    Re^2: Re-orderable keyed access structure? f.ex makes me wonder whether the actual order is at all important, or whether you simply require it as a corollary of your way of looking at your problem. That description looks vaguely like you might want a heap instead of a list.

    But I don't know, so I could be completely wrong on this, and therefor I'm asking for what it really is you're trying to solve.

    Makeshifts last the longest.

      The ordering is important. I'm using a tied hash representing keyed data on disk. The data is too large to fit in memory and so the tied hash internals read/write to/from disk as required. That's all working fine, but hitting the disk for every read & write has a cost.

      To mitigate that cost, I wish to cache some amount of the data. The 'items' in my descriptions are (references-to + state-of) the cached data.

      The hash in the description is the internal (to the tie) keyed access to the cache.

      The array is an ordered stack that maintains the "timeliness" of the cache. Things nearer the top should be retained longest when possible. Things near the bottom should be discard first. The ordering of the array is it's whole purpose for existing.

      Each time an element in the tied hash is accessed, the first thing to do is look to see if it is currently in the cache (the internal hash). If it is, then it can be returned (FETCH) or modified(STORE) as required. Having been accessed, it's position in the stack is adjusted according to some heuristic.

      If a LRU algorithm was chosen, then any access would promote the item's tag (reference or key or index) from it's current place in the body of the stack to the top. Everything previously above it moves down one.

      When an attempt to access a key in the tied hash results in a miss in the internal cache, the bottom element of the stack is shifted off to free up space for the new item read from disk.

      But a standard LRU algorithm isn't everything. If the bottom element of the stack has been updated, then it will need to be written back to disk before being discarded. If there is an unchanged element in the stack above it, it may be cheaper to discard that and leave the modified element in the cache until there are no unmodifed elements left to discard in case another modification for that modified element comes along in the near future.

      So, a possible algorithm to use would be to place any newly read-from-disk element at the top of the stack. Any FETCH accesses to an element already in cache promotes it's position in the stack by one place. Any STORE accesses to an existing element in the stack might promote it's position 2 places. In this way, modified elements will tend to be retained longer, after their last access, than unmodified elements.

      There are numerous variations upon this scheme that I would like to try. Put new element into the middle of the stack, and then allow their position to be adjusted relative to the accesses made. And so on.

      So, the obvious choice is to use an array to maintain the ordering as splice can easily perform all the operations described above. MoveToTop(), MoveToBottom(), promote(N), demote(N). Complete flexibility to tailor the heuristic--even dynamically.

      But, splice requires indexes. The data is accessed via keys. There needs to be a way of mapping keys to index so that each time a cache element is accessed, the appropriate adjustment can be made to the stack.

      But when an item is shifted off the bottom of the stack, I need to be able to map that item back to it's key, so that it can be removed from the the internal cache-hash.

      The overall scheme is somewhat more complicated than that, but hopefully this describes why I need the two types of structure, the two types of access, and why none of the existing caching mechanisms that I've looked at on CPAN will do what I am trying to do.

      Placing the items in the array, and storing references to the array elements as the values of the cache-hash, means that keyed access to the items is simply one level of indirection from the key.

      It also means that as the positions of the elements in the array get swaped around, the keyed access to them is self-maintained.

      The problem creeps in when shifting the last element off the array in order to discard it. The corresponding (internal) hash element needs to be deleted also, but the only way I can see to access it, is to iterate the keys of the hash and inspect the values, until I find one that matches a reference to the element being discarded. That's a slow linear search that I was hoping to avoid.

      The best scheme I have come up with so far is to store fixed length (padded) copies of the keys in a string. This allows me to use substr to 'splice' the keys around easily. Locating the index for the key to move requires a linear search, but linear searching a string is much faster than linear searching an array. It also has the upside of requiring less memory for the infrastructure.

      The downsides are that the keys must be fixed length. As it happens, for my particular application, the keys are fixed length (3-chars) anyway, but I would still like to avoid the linear search (or structure iteration) if I could. It increasingly seems that this isn't possible.

      I'd also have liked the caching mechanism to be generic, which I loose with the fixed-length key restriction. For a few applications, it may be possible to predict the maximum length and pad to achieve this, but for many others, the length of the keys would not be known until runtime.

      I apologise for there being too many words, but my attempts to explain it with less have obviously failed dismally :)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        So on the outside, the access is by key only? Then the order doesn't matter at all. It's just your way of looking at the weighting of your items, but that's all your task calls for — not an order, but a way to weight elements. Your single constraint is that there needs to be a fast way to locate the item with the lowest (or highest, depending which way around you want to define it) weight.

        Well, that is indeed what a heap is good for. A heap is a complete binary tree with a simple condition: the parent's value is always greater than either of its childrens' values. That means the greatest value of the tree is always at its root. Leaves only exist at the very bottom of the tree. To insert an element, you attach it to a free slot of any of the leaves. In that case, or if you adjust a child's value such that it violates this requirement, the violation is trivial to fix: you swap it with its parent, which may recursively cause a new violation with regard to its new parent. Since the structure is always a complete binary tree and you have to do this for any operation (either up the tree or down the tree depending on whether you are inserting or removing elements), all operations happen in O( log n ).

        This is obviously faster than your linear-time splices.

        And there is more excellent news for you: since a heap is a complete binary tree, it's easy to map to an array. Just define the children of the element at position i to be the elements at 2i and 2i + 1. Obviously, the parent of an element is int( i / 2 ). That's for a 1-based index, for 0-based index, you need some additions and subtractions to twiddle the numbers appropriately. The tree root is then at 1, its children are at 2 and 3, their respective children are at 4, 5 and 6, 7, the next level of siblings is at 8 - 15, etc.

        In your case, I'd store your actual data in a hash as one would normally do. I'd use an array-as-heap to store hash keys, and another hash to be able to map keys to heap locations. That mapping is bidirectional, so you don't need any linear scans. If you need to remember explicit weights, it's cheapest to do that using another $key => $weight hash — that way you avoid having zillions of tiny anonymous data structures, and keeping two or three hashes with identical key sets in synch is a no-brainer anyway.

        If that was too short and you haven't read Sedgewick's "Algorithms", get it. Actually, if you haven't read it, get it, regardless of whether the above was sufficient. It's a classic and a staple of CS for good reason.

        Makeshifts last the longest.

Re: Re-orderable keyed access structure?
by injunjoel (Priest) on Aug 15, 2004 at 00:20 UTC
    Greetings all,
    After much discussion with BrowserUK I realized I was barking up the wrong tree...
    Perhaps you could store the value of the key in the hash based index as the first element of the data its pointing to? That way the first element of each of the original arrays would be the key of the hash used to access it. The key can be used to get the array or vice versa.
    use strict; my @ref_array = ([state,pointer],[state,pointer],[state,pointer]); my $i = -1; my %key_index = map($i++; unshift(@{$_},$i); $i,\$_;)@ref_array;

    Of course I have not tested this (my BSD partition is mad at me presently). But I think the logic will satisfy your 3 requirements.

    UPDATE:(Aug.16th)
    Just a quick note the my @ref_array illustrated above was meant to represent the original items mentioned in the original post. The [state,pointer] is in reference to another post in which the structure of the original items was explained...
    my apologies for the confusion.
    example below:
    This is what I was talking about.
    %key_index items in anon.arrays (@ref_array in my post) { 0 => \itemA } [ 0, status, dataA ] { 1 => \itemB } [ 1, status, dataB ] { 2 => \itemC } [ 2, status, dataC ]

    this way the key (0-2 in the example) is stored with the original data. that way you can always get the key of the index_hash from the element itself. The shuffling should not matter at that point.



    -injunjoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo

      Note that anonymous arrays aren't free. Their memory consumption grows alarmingly when you have 6-digit amounts of them. Since you have a fixed amount of two elements per anonymous array, it is advisable to flatten that AoA to two parallel named arrays.

      I would not normally be giving this advice; in general, it is better to keep things that belong together, together. With two arrays, you will need to maintain the code to do all operations twice, and the CPU will need to execute all operations twice, as well. It violates the DRY principle.

      But as someone who's dealt with 600MB Perl processes, I know the pain of too-deep structures. When you're slinging around huge amounts of data, the extra CPU cycles spent to maintain parallel structures are easily offset by the amount of additional data you'll be able to fit into memory at once (and therefor, going to disk much less frequently (particularly if, goodness forbid, you'd otherwise end up swapping.))

      Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://382884]
Approved by kvale
Front-paged by PERLscienceman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2019-10-14 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?