good chemistry is complicated,
and a little bit messy -LW
Perl's hashes are incredibly useful. They lend themselves to so many purposes and algorithms. Although I'd encountered them before I started using perl -- my first real scripting language was REXX where they are called associative arrays -- Perl's ability to nest other data structures, arrays and other hashes, within a hash to an arbitrary depth was somewhat of a revelation to me. In some senses, it changed the way I think about coding solutions to problems. The great benefits of hashes include
That all translates to an incredibly useful, useable and flexible tool.
There is one major downside. They are memory hungry. For example, storing integers 1 .. 1_000_000, as ASCII in a file one per line takes 7_888_898 bytes including the newlines, and just a few bytes more if loaded into memory as a string. However, loading the same 1 million integers into a hash as keys, with undef as the value requires 95 MB!
For many (most?) applications that manipulate reasonable volumes of data, that is fine, but there are a class of applications that manipulate large volumes of data, for which the memory hungriness does present a problem. In these applications, once the boundaries of memory capacity have been breached, they require a sea change in coding in order to compensate.
Even with the best will in the world, once a single large hash starts getting swapped to disc, you can almost right off the performance benefits of a hash. Unless you are extremely lucky, and your data is such that iterating the keys of the hash happens to move sequentially through memory -- something that is most unlikely to happen and near impossible to arrange -- then the process of the iteration is probably going to force the OS to traverse back and forth through the entire swapped storage, causing a huge number of hits to the file system. Even the most intelligent OS disc caching algorithms are unlikely to be able to do much to alleviate this situation. The result is, at best, a sudden and dramatic fall-off in performance once the free memory capacity of the machine has been breached. At worst, if swap space is limited or already consumed by other processes, you get the fatal, "out of memory" error and you're application is stopped dead in it's tracks.
Once this point has been reached, the first alternative usually suggested it to avoid loading all the data into memory by reading and processing it serially, which works fine for some applications, but perhaps the most oft-used use of hashes is as fast lookup tables, and these only work when all the data is memory resident.
The second alternative is one of the "structured file" DBs, like DB_file. Whilst these can easily handle large volumes of data, compared to an in-memory hash, they are interminably slow. Just creating a DB_File hash that contains 1 .. 1_000_000 as keys requires 42minutes of CPU and nearly an hour of elapsed time. And that is when generating the keys programmatically. If you have to read the keys from a file that resides on the same drive, it takes even longer. Of course, you only need do this once -- if your data is re-used -- but if your data is generated as a flat file and is only used once, then this overhead is expensive.
When it comes to iterating those 1 million keys, it takes over 15 minutes compared to just 8 seconds for an in-memory hash.
This dramatic transition from high speed, high-memory consumption to slow speed, low-memory consumption can be problematic for applications who's data sets are close to the memory limits. Using an in-memory hash for speed risks the application transitioning to swapping with the resultant unpredictable slowdown. Moving to using a DB impacts the performance of the application even when well below the memory limits. Either choice is unsatisfactory.
Perls arrays are less memory hungry than hashes, and if you sort the array, then using a binary search can be quite efficient and is fairly easy to implement. However, Storing 1 .. 1_000_000 as an array still requires 60 MB of memory and iterating the keys using a binary chop takes over 5 minutes.
The reduction in memory consumption by a third only delays the onset of memory exhaustion for a short while and the nearly 50X slowdown in access is disappointing.
As mentioned earlier, storing the same keys as a string cuts the memory consumption to just 7.5 MB, and this could be easily searched using regex or just index, but the searches are terribly slow. Searching for the first 10,000 keys takes around 15 seconds, but the last 1,000 takes nearly 10 minutes. A crude calculation shows that iterating the million keys would take something like 3 DAYS! Perl's very good at searching short strings quickly, but searching a 7.5 MB string is just too expensive to be practical.
However, if we could distribute the keys across enough scalars to keep them short, but still keep the total number of scalars (array) low so as not consume too much memory, we would then need to find a way to select the correct scalar to search. In effect, what we need is a hashing algorithm that will map the range of keys to a limited number of scalars (buckets) and then use a linear search to complete the search.
The memory-poor man's hash
If we used the ASCII value of the first character of each key, we can map the 1 million keys in my example to 9 strings and the storage requirements would be little more than the 7.5 MB required for a single scalar However, the search time would be reduced by an order of magnitude. With each of the 9 strings being 3/4 MB, the search time would still be too slow.
The next step is to use the first 2 chars of each key. This gives a mapping to 99 scalars, 90 of which are 75k each, (with the other 9 just 2 bytes each). Still the memory requirement is under 8 MB, but now the time taken to iterate the million has come down to around 25 minutes. Much better, but still much slower than DB_File, 200X slower than an in-memory hash. We can do better.
Moving to a 3-byte index, the memory requirement is still less than 8MB, but we now have 999 scalars, 900 of 7,654 bytes each. This now allows us to iterate the entire million keys in under 5 minutes. Substantially better than DB_File, and a approaching a reasonable compromise, trading 95 MB & 8 seconds to iterate for under 8MB and 260 seconds seems worthwhile. Can we do better still?
Moving to using a 4-byte index means we get 9,999 keys of 765 bytes each. The overhead of the 10-fold increase in scalars and the hash to hold them means that the memory requirement has risen to around 13 MB. However, the benefit of that trade is that we can now iterate the million keys in 44 seconds. Still a little slow compared to the 8 seconds, but an entirely reasonable trade to recover 82 MB of memory, or increase our usable memory capacity by a factor of 6.
Conclusions and caveats
Using this form of custom hashing can produce considerable memory saving/ capacity increase whilst retaining a reasonable degree of performance. The coding effort required is minimal and simple.
The caveat is that it would be difficult to wrap this up into a module. The choice of a 4-byte index works here because the nature of the data, numeric, means that 4-bytes will map all the keys to just 10,000 scalars, but if the data was uppercase alpha, this would result in potentially 456,976 scalars, or mixed case alpha to 7,311,616 and so on.
Whilst it is unlikely (or impossible) that the data would actually result in using the full range of these, the array or hash overhead would negate the memory saving. Also, if the keys in question had common prefixes, as might be the case with Memory Management Problem where log files are often named something like log20031121, using the first three characters for indexing would result in all the keys mapping to a single huge scalar with the obvious effect of horribly slow access. Equally, using the next 3 or 4 characters is unlikely to be satisfactory.
The correct choice of indexing characters is wholly dependant upon the nature of the data involved. The only alternative to having the programmer inspect and make an intelligent selection is to try for a universally applicable hashing algorithm, and perl already has that.
All timings and memory requirements taken using AS 5.8.0 running under NT/4 on a 233 MHz Pentium II.
Examine what is said, not who speaks."Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
janitored by ybiC: Balanced <readmore>'s associated with <h4>'s