Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

A (memory) poor man's <strike>hash</strike> lookup table.

by BrowserUk (Patriarch)
on Nov 21, 2003 at 16:41 UTC ( [id://308947] : perlmeditation . print w/replies, xml ) Need Help??

Perl's hashes are incredibly useful. They lend themselves to so many purposes and algorithms. Although I'd encountered them before I started using perl -- my first real scripting language was REXX where they are called associative arrays -- Perl's ability to nest other data structures, arrays and other hashes, within a hash to an arbitrary depth was somewhat of a revelation to me. In some senses, it changed the way I think about coding solutions to problems. The great benefits of hashes include

The pro's

  • They are fast.
  • They are simple to use.
  • They are universal (polymorphic).

That all translates to an incredibly useful, useable and flexible tool.

The con's

There is one major downside. They are memory hungry. For example, storing integers 1 .. 1_000_000, as ASCII in a file one per line takes 7_888_898 bytes including the newlines, and just a few bytes more if loaded into memory as a string. However, loading the same 1 million integers into a hash as keys, with undef as the value requires 95 MB!

For many (most?) applications that manipulate reasonable volumes of data, that is fine, but there are a class of applications that manipulate large volumes of data, for which the memory hungriness does present a problem. In these applications, once the boundaries of memory capacity have been breached, they require a sea change in coding in order to compensate.

Even with the best will in the world, once a single large hash starts getting swapped to disc, you can almost right off the performance benefits of a hash. Unless you are extremely lucky, and your data is such that iterating the keys of the hash happens to move sequentially through memory -- something that is most unlikely to happen and near impossible to arrange -- then the process of the iteration is probably going to force the OS to traverse back and forth through the entire swapped storage, causing a huge number of hits to the file system. Even the most intelligent OS disc caching algorithms are unlikely to be able to do much to alleviate this situation. The result is, at best, a sudden and dramatic fall-off in performance once the free memory capacity of the machine has been breached. At worst, if swap space is limited or already consumed by other processes, you get the fatal, "out of memory" error and you're application is stopped dead in it's tracks.

The alternatives

Serialised processing

Once this point has been reached, the first alternative usually suggested it to avoid loading all the data into memory by reading and processing it serially, which works fine for some applications, but perhaps the most oft-used use of hashes is as fast lookup tables, and these only work when all the data is memory resident.

Structured DB's

The second alternative is one of the "structured file" DBs, like DB_file. Whilst these can easily handle large volumes of data, compared to an in-memory hash, they are interminably slow. Just creating a DB_File hash that contains 1 .. 1_000_000 as keys requires 42minutes of CPU and nearly an hour of elapsed time. And that is when generating the keys programmatically. If you have to read the keys from a file that resides on the same drive, it takes even longer. Of course, you only need do this once -- if your data is re-used -- but if your data is generated as a flat file and is only used once, then this overhead is expensive.

When it comes to iterating those 1 million keys, it takes over 15 minutes compared to just 8 seconds for an in-memory hash.

Dramatic transition

This dramatic transition from high speed, high-memory consumption to slow speed, low-memory consumption can be problematic for applications who's data sets are close to the memory limits. Using an in-memory hash for speed risks the application transitioning to swapping with the resultant unpredictable slowdown. Moving to using a DB impacts the performance of the application even when well below the memory limits. Either choice is unsatisfactory.

Other alternatives.


Perls arrays are less memory hungry than hashes, and if you sort the array, then using a binary search can be quite efficient and is fairly easy to implement. However, Storing 1 .. 1_000_000 as an array still requires 60 MB of memory and iterating the keys using a binary chop takes over 5 minutes.

The reduction in memory consumption by a third only delays the onset of memory exhaustion for a short while and the nearly 50X slowdown in access is disappointing.

A scalar

As mentioned earlier, storing the same keys as a string cuts the memory consumption to just 7.5 MB, and this could be easily searched using regex or just index, but the searches are terribly slow. Searching for the first 10,000 keys takes around 15 seconds, but the last 1,000 takes nearly 10 minutes. A crude calculation shows that iterating the million keys would take something like 3 DAYS! Perl's very good at searching short strings quickly, but searching a 7.5 MB string is just too expensive to be practical.

However, if we could distribute the keys across enough scalars to keep them short, but still keep the total number of scalars (array) low so as not consume too much memory, we would then need to find a way to select the correct scalar to search. In effect, what we need is a hashing algorithm that will map the range of keys to a limited number of scalars (buckets) and then use a linear search to complete the search.

The memory-poor man's hash

If we used the ASCII value of the first character of each key, we can map the 1 million keys in my example to 9 strings and the storage requirements would be little more than the 7.5 MB required for a single scalar However, the search time would be reduced by an order of magnitude. With each of the 9 strings being 3/4 MB, the search time would still be too slow.

The next step is to use the first 2 chars of each key. This gives a mapping to 99 scalars, 90 of which are 75k each, (with the other 9 just 2 bytes each). Still the memory requirement is under 8 MB, but now the time taken to iterate the million has come down to around 25 minutes. Much better, but still much slower than DB_File, 200X slower than an in-memory hash. We can do better.

Moving to a 3-byte index, the memory requirement is still less than 8MB, but we now have 999 scalars, 900 of 7,654 bytes each. This now allows us to iterate the entire million keys in under 5 minutes. Substantially better than DB_File, and a approaching a reasonable compromise, trading 95 MB & 8 seconds to iterate for under 8MB and 260 seconds seems worthwhile. Can we do better still?

Moving to using a 4-byte index means we get 9,999 keys of 765 bytes each. The overhead of the 10-fold increase in scalars and the hash to hold them means that the memory requirement has risen to around 13 MB. However, the benefit of that trade is that we can now iterate the million keys in 44 seconds. Still a little slow compared to the 8 seconds, but an entirely reasonable trade to recover 82 MB of memory, or increase our usable memory capacity by a factor of 6.

$hash{ substr $_, 0, 4 } .= $/ . $_ for 1 .. 1000000; $n=0; print time; for (1 .. 1000000) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/.$_ ); $n++; } print time, $/, $n; 1069427237 1069427281 1000000

Conclusions and caveats

Using this form of custom hashing can produce considerable memory saving/ capacity increase whilst retaining a reasonable degree of performance. The coding effort required is minimal and simple.

The caveat is that it would be difficult to wrap this up into a module. The choice of a 4-byte index works here because the nature of the data, numeric, means that 4-bytes will map all the keys to just 10,000 scalars, but if the data was uppercase alpha, this would result in potentially 456,976 scalars, or mixed case alpha to 7,311,616 and so on.

Whilst it is unlikely (or impossible) that the data would actually result in using the full range of these, the array or hash overhead would negate the memory saving. Also, if the keys in question had common prefixes, as might be the case with Memory Management Problem where log files are often named something like log20031121, using the first three characters for indexing would result in all the keys mapping to a single huge scalar with the obvious effect of horribly slow access. Equally, using the next 3 or 4 characters is unlikely to be satisfactory.

The correct choice of indexing characters is wholly dependant upon the nature of the data involved. The only alternative to having the programmer inspect and make an intelligent selection is to try for a universally applicable hashing algorithm, and perl already has that.


Data structure
Memory requirement
Time to iterate
Standard hash 95 8
DB_File ? 967 ( > 15 mins )
Array 60 322 ( > 5 mins)
Single (huge) Scalar 7.5 ~= 300,000 ( > 3 days )
1-byte index 7.5 ~= 32000 ( > 8 hours )
2-byte index 8 1,572 (25 mins)
3-byte index 8 258 ( 4.5 minutes )
4-byte index 13 43

All timings and memory requirements taken using AS 5.8.0 running under NT/4 on a 233 MHz Pentium II.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

janitored by ybiC: Balanced <readmore>'s associated with <h4>'s

Replies are listed 'Best First'.
Re: A (memory) poor man's hash
by tilly (Archbishop) on Nov 21, 2003 at 19:35 UTC
    First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it. On 5.8, initializing a large hash with undef $foo{$_} for 1..1_000_000; takes about 69 MB. Initializing with undef @foo{1..1_000_000}; runs faster but takes about 130 MB. (This is run on Perl 5.8.1 on Linux, YMMV.) I'm guessing that the difference is how much buffering I am doing.

    The second tip. If you are doing to use dbm with a lot of data, always, always, always use a B_TREE. On a faster machine than yours (about a factor of 10), my insert time was 12 seconds. So that is 2 minutes. See the docs for a better overview and if you need to tune it more, see here in addition.

    Now I'll grant you that walking keys close to in order is pretty much a best case scenario, but very few real-world scenarios don't have considerable locality of reference, so B_TREE accesses tend to win big on datasets of under a few hundred million.

    And this leads to a general note. Hashes are probably the fastest general-purpose access method when you have a flat memory model. And they serve Perl very well. But the real world is moving more and more towards multiple tiers of data storage with very different speeds between them. (In decreasing order of speed: on CPU we have level 1, 2, and 3 caches, followed by main RAM, NUMA adds a few layers here, then we have local hard drives, attached storage devices, network drives, then stuff remote to the LAN.) Given differentials in Moore's law, this trend is likely to continue growing in importance as the ratios between the speeds of these layers get more and more extreme, making alternatives to hashing make more and more sense.

      First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it....

      The difference in the memory consumed by the 2 methods you show is that with undef @foo{ 1 .. 1_000_000 ];, the extra memory is consumed by generating the list 1 .. 1_000_000 prior to generating the hash slice.

      When you use ...for 1 .. 1_000_000;, no list is generated as for has an optimisation to allow integer ranges to act as iterators rather than list generators.

      The method I used was

      $hash{ $_ } = undef for 1 .. 1_000_000;
      but on my machine,
      undef $foo{ $_ } for 1 .. 1_000_000;
      results in an identical memory consumption. Prior to the allocation of the hash, the perl process has 3332k allocated to it. After the allocation, it has 98184k allocated. Hence my assertion that the memory required is 95 MB. I can't explain the difference between the 95 MB I am seeing and the 69 MB you reported.

      I tried using Devel::Size to verify the memory allocation the OS is reporting is actually being used by the hash rather than some portion of it being intermediate working storage that would be available to the process for other purposes afterwards, but calling either size( \%hash ); or total_size( \%hash ); both doubled the memory consumed by the process, pushing it beyond both my physical and virtual memory capacity before segfaulting.

      Using DB_TREE rather than DB_HASH does speed up the insertion considerably.

      use DB_File; tie %hash, 'DB_File', 'tree1M.db' , O_RDWR|O_CREAT, 0666, $DB_BTREE or die $!; print time; $hash{ $_ } = undef for 1 .. 1_000_000; print time; 1069482082 1069482433 # 351 secs (5.85 mins) $n=0; print time; exists $hash{ $_ } and $n++ for 1 .. 1_000_000; print time +, $/, $n; 1069482523 1069482799 # 276 secs (4.6 mins) 1000000

      It reduces the insertion time from an hour to 5 minutes and the iteration time from 15 minutes to 4 at the cost of increasing the size of the database file from roughly twice the flat file size to 4x.

      If you are only using the hash for it's fast "does it exist" property, then using the custom hash mechanism I described contrasts favourably with the insertion taking 16 seconds and the iteration 44.

      print time; $hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/ + for keys %hash; print time; 1069484550 1069484566 # 16 seconds. $n=0; print time; for ( 1 .. 1_000_000 ) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/ . $_ . $/ ); $n++; } print time, $/, $n; 1069484612 1069484656 # 44 secs. 1000000

      With respect to caching. I was recently playing with prime number generators/testers and came across some discussion about techniques for optimising code (C and assembler) to make best use of L1 and L2 caching. I'll admit that much of the discussion was over my head, but my conclusion was that compilers will have to become much clever than they currently are before general code will fully benefit from these.

      The current trend to making caches larger and larger will mitigate this somewhat, but the depth of understanding required to fully utilise these caches and tailor algorithms to them is so great, and so processor specific, that I doubt that it will ever be more than a niche exercise.

      Whether processors will ever get to the point where the microcode is capable of utilising out-of-order execution, re-ordering the pipelines, or re-writing the instruction caches to optimise cache hit rates at anything beyond the keyhole scale I'm not sure. Given that the disposable processor you get in musical birthday cards has more processing power in its 4mm2 of silicon that ENIAC did in its 200k watt consuming 30 tons, who knows what a "computer" will be capable of in another 60 years :)

      It's just a shame I won't be around to see it.

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

        I thought of the overhead of building the list, but the difference was more than I would think that that takes into account. But I did not measure it. However I just used Devel::Size and it reported 32083260 as the size of the hash and 44083260 as the total_size. (Regardless of how I built that hash.) My guess is that the rest is extra buffering on the part of various memory requests. Part of which is going to be the OS buffering Perl's mallocs, which is likely to be highly OS dependent, and possibly order of request dependent as well.

        On the implementation, note that the performance can be improved both by moving from tied data structures to OO access and again (probably more) by using documented tuning parameters. Additionally with a dataset this small, it may be acceptable to switch back to a hash stored in memory (since BerkeleyDB's hashing takes less memory than Perl's). Furthermore the fact that it is general-purpose makes me more willing to take that approach rather than constructing a special-purpose solution that is possibly fragile.

        And on caching. I agree heartily that optimizing for perfection is only going to be a niche exercise. I disagree that optimizing for better use of multi-level memory is unimportant though. I predict that you will see programmers over time being pushed towards data structures that are readily tuned to local memory configurations, dynamically if possible. You could view the popularity of databases as a step in this direction. A far bigger step is Judy arrays. Sure, it is very complicated to implement, but you only need to implement once. (Well, Perl would need to reimplement for licensing reasons...) And from the end programmer perspective, it is just a question of replacing the black box of a hash by the black box of another associative array implementation and voila! Things run faster!

        A few more years of increasing discrepancies between access times, and the difference will become bigger still. Eventually it will be big enough that common scripting languges will make the switch. Most programmers will never need to understand what actually happened, any more than we really understand most of the ongoing changes that make Moore's law tick in the first place.

Re: A (memory) poor man's hash
by PetaMem (Priest) on Nov 21, 2003 at 19:00 UTC

    Congratulations, you just invented a kind of a Trie.

    Besides that, you speak of a very serious problem (at least for us). And we're not that poor on memory. But Perl forces us to migrate some applications on 32GB RAM machines. Alas, they're 64bit and Perl is much more of a memory hog there than it is on 32bit. *sigh*

    A less memory-hoggy Perl is on my wishlist for 5.10 and compare also this node of mine.

    This problem has only one solution. A better implementation of hashes within perl - or at least some option where one can choose whether he'd like to trade memory for speed.

    Again - I hope to see it in 5.10.

        All Perl:   MT, NLP, NLU

      Yeah it appears to be a hybrid Trie implementation. Which I personally find extremely amusing. The primary drawback with using Trie's is that they are typically terribly wasteful of memory. (Except if the data set is extremely dense.) Which is why it turns out that most times Tries are implemented they are implemented in a hybrid form, or are designed to work on a limited data set or both. OTOH one advantage of a pure Trie lookup is that a C implementation of the algorithm performs similarly or better than a hash lookup but does prefix matching as well.


        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi

Re: A (memory) poor man's hash
by merlyn (Sage) on Nov 21, 2003 at 17:52 UTC
    However, loading the same 1 million integers into a hash as keys, with undef as the value requires 95 MB!
    I stopped reading there. I don't see your point. Besides storing all the data, you now have a meta-data structure that can tell you rather rapidly if $x is a member of this set you've created, as well as associate another scalar with each of those million keys!

    You've got a lot more information than what you started with. You're not merely storing the keys.

    If your complaint is that you want to be able to just store the keys, then yes, a hash was a bad choice, as you go on to point out.

    But don't fault Perl's hash structure. It's very efficient for the task at hand.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I've changed the title of the post to more accurately reflect the intent of the post.

      I thought my opening paragraph(s) made it very clear that I have no "complaint", nor was I " fault<ing> Perl's hash structure" for the vast majority of applications.

      The only application I was suggesting this would be useful for is the "... perhaps the most oft-used use of hashes is as fast lookup tables..." as exampled by the referenced post Memory Management Problem.

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Re: A (memory) poor man's <strike>hash</strike> lookup table.
by cees (Curate) on Nov 22, 2003 at 03:10 UTC

    There appears to be a bug in your code which is masked by the data you chose and the test you made. Your search only matches the start of the key, when it should match the end as well. For example with your data, a search for the key 12345 will look in bin 1234 for a string matching $/.'12345', but it could find a string like 1234567 and still return a match.

    Also, for interest sake, you can reduce the memory footprint further by stripping out the bucket index from each of the keys. Using the above example again, if we wanted to store 1234567, we can just store 567 in the bucket keyed by 1234. The trivial case of storing 1234 will be stored as an empty string, which will be found correctly by searching for $/$/.

    The following code reduced the footprint by another couple of megs. This reduction should also speed up the searching somewhat.

    #!/usr/bin/perl $hash{ substr $_, 0, 4 } .= $/ . substr($_, 4) for 1 .. 1000000; $hash{$_} .= $/ foreach keys %hash; $n=0; print time, $/; for (1 .. 1000000) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/.substr($_, 4).$/ ); $n++; } print time, $/, $n;

    Other than that, it is an interesting concept that I can't see myself ever using :)

    - Cees

      Indeed you are correct, my original code is flawed as you describe. Thanks for pointing out my error Cees++

      I've re-run a subset of the tests using the following slightly modfied version which corrects this error and the correction slows the insertion time by approximately 1 second for the million keys. The iteration time comes out exactly the same.

      print time; $hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/ for keys %hash; # Append a final separator print time; 1069484550 1069484566 # 16 seconds. $n=0; print time; for ( 1 .. 1_000_000 ) { # Ensure no partial matches by prepending and appending # separators to the search key. next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/.$_.$/ ); $n++; } print time, $/, $n; 1069484612 1069484656 # 44 secs. 1000000

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Re: A (memory) poor man's <strike>hash</strike> lookup table.
by TimToady (Parson) on Nov 22, 2003 at 00:07 UTC
    For another approach, look at Tie::SubstrHash. You'd have to recast your problem as keys of constant length though. Still, it'd be interesting to see a timing comparison.

      The timing comes out to 10 minutes for the insertion and 9.6 minutes for iteration of the million keys.

      use Tie::SubstrHash; tie %hash, 'Tie::SubstrHash', 7, 1, 1_000_000; print time; $hash{ pack 'A7', $_ } = ' ' for 1 .. 1_000_000; print time; 1069489311 1069489911 $n = 0; print time; $n += defined $hash{ pack 'A7', $_ } for 1 .. 1_000_000; print time, $ +/, $n; 1069490448 1069491023 1000000

      I think that it may be possible to improve this a little by updating the module. It uses local several places where my would probably be quicker. Though, I suspect that the main performance hit is the cost of tie rather than the implementation. Part of my reasoning for liking the mechanism I described was that it was simple enough in it's implementation that it could be coded directly avoiding the need for tie.

      I also think that it would be possible to improve the user interface by having the module perform any padding required rather than the caller,

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Re: A (memory) poor man's hash
by Anonymous Monk on Nov 21, 2003 at 17:55 UTC

    Maybe I'm being brain-dead here (wouldn't be the first time), but you start off with a hash of a million integer keys, but you end up with something much less useful than a hash. No associated values, and no guarantee of a million unique "keys" unless you use the two-pronged test during insertion. It may or may not function as a memory efficient "set", but it isn't a hash (not even a poor man's hash).

Re: A (memory) poor man's hash
by Paulster2 (Priest) on Nov 21, 2003 at 18:20 UTC

    Two things:


    They are simple to use

    Being the newbie that I am, I still find hashes a little daunting. I understand what they are and basically how they are used, but implimenting them is another story. I have successfully modified others writs and made some of my own simple ones. I guess getting the info in is a lot easier than getting it out. So while others may feel that they are easy to use, I haven't gotten there yet. In other words, I guess that all of this comes down to opinion. I know that hashes are the life blood of PERL so I guess I better start figuring it out.

    Second: While hashes may have a tendancy to soak up a lot of memory, it's not using it for that long. I haven't done the math, but I would bet if you put a dollar amount on the time wasted using other methods vs. the amount of bog that you may incur to other programs at the same time on any given system, you would find you have saved tons of money using hashes. How many clock cycles do you waste waiting on inefficient languages to do their thing?

    Bottom line to my statement is by looking at the big picture, you may be surprised!


    PS: I ++ you anyway, even though I don't agree. Mainly because I found it a stimulating writ.

      I still find hashes a little daunting.

      BrowserUK (if I may presume) was speaking in relative terms. Hashes are quite easy compared to (for example) Perl's advanced data structures, object system, or closures.

      While hashes may have a tendancy to soak up a lot of memory, it's not using it for that long.

      The problem is that perl usually doesn't free the memory back to the OS until the process exits (though it will reuse that memory for other things). This is particularly a problem for mod_perl, where the process will stick around as long as Apache is up (potentially years, if your sysadmin is neglegent about security patches ;).

      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

        ...(though it will reuse that memory for other things). This is particularly a problem for mod_perl, where the process...

        One approach that I always use when a request is going to be quite resource consuming, is to let the child process die after the request is served. You can do this with $r->child_terminate. The extra CPU for forking a new child process is a lot less than the CPU you would lose when the server starts swapping.


        MaxRequestsPerChild... or maybe Apache::SizeLimit

        Not an editor command: Wq
Re: A (memory) poor man's hash
by Anonymous Monk on Nov 21, 2003 at 18:43 UTC
    Where is your benchmark code?
Re: A (memory) poor man's <strike>hash</strike> lookup table.
by demerphq (Chancellor) on Dec 01, 2003 at 13:20 UTC


    Over the weekend I did a little work on this. I loaded your million numbers into a digit trie (a Patricia Trie restricted to the digits 0-9) optimized for reduced memory consumption. The records loaded in around 100 seconds on a 2.2 gighz machine (if temporary space is not an issue this time drops dramatically), and ultimately required less than 5MB to store. I calculated the ratio of bytes used to store the data versus the sum of lengths of the keys and it was 0.62, meaning that the data structure requires 0.62 bytes to store one byte of key. Lookup time in the structure is comparable (or better) than a raw hash (I will admit the implementation of the lookup is done in Inline::C/XS).

    Note that this Trie implementation may not be optimal. Because it stores the keys as digits strings instead of byte encoding the integer and then doing a byte-trie its probably less efficient a trie implemented as the latter. You could probably calculate the storage requirements but I have to admit I cant be bothered, however I will hazard a guesstimate that it would be around 2.5MB. Also this structure would be even faster to do lookups on (the issue of packing the integers as necessary before lookup aside).

    Also, these calculations are _purely_ about storing the keys. Not about storing the key along with a value. Assuming this was required and the values to be stored were long ints you could add 4MB to both numbers (storing them as packed longs). This would mean the digit trie required total 8.5 mb and the byte trie would come in at 6.5MB. (Actually im not bothering to calculate the size of an SV here. Presumably you would need a small amount of ram, say (a very generous) 1k for the object that holds the two char arrays.)

    Which actually brings up a different point. Instead of storing your million numbers in an array you could store them in a packed string. This would mean the total memory requirement would be 4mb, and would still facilitate binsearching. Finger points into the data may speed that search up drammatically.

    The discussion with Judy arrays led me to rereview the subject. Judy arrays are going to be slightly less efficient than a pure byte-tree in terms of storage. I think it would be safe to say that for data packed this densely the minimum bound for memory utilization (without bringing compression into the picture) is going to be the numbers you get from a Trie. However its worth noting that this scenario is realtively unrealistic. With lower density data sets my understanding from what ive read on Judy arrays would suggest they would be much more efficient.

    I wanted to say that I found this thread thought provoking and interesting and I hope you post more along its line. The subject of optimisation techniques is an interesting one that is IMO useful and worth discussing. All the old caveats about optimizing are correct, but the fact comes that once you identified an optimization target having discussion and documentation of ways to handle it can be very useful.

    PS: Its worth noting that conceptually a Trie is essentially a non cpaturing DFA regex engine anchored at the beginning of the string. Its not difficult to extend a trie to handle Kleene closure (*) and one-or-more (+), and only slighlty more difficult to make it handle non anchored (^) searches as well. The advantage of this is that its more efficient for matching many constant strings than the NFA regex engine that perl uses. (Although of course with serious deficiencies.) Ive long wished that perl handled this internally, allowing for a modifier to cause a regex pattern to be treated as a DFA an not as an NFA, which would essentially mean putting a trie engine into perl.

    Its weird that these things are related (who would guess there is a direct family tree relationship between regular expressions and judy arrays!) In fact you could take this point further and argue that really there is a relationship between linked lists (degenrate binary trees) and regular expressions! Woo-hoo! Talk about 6 degrees of seperation. ;-)



      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

    • Update:  
    Bart suggested breaking the paragraphs up, and pointed out some typos. Thanks Bart.


      Hi demerphq.

      Sorry for taking so long to get back to you but I had to find out what Tries and Patricia Tries were before I could understand your post and appreciate the effort that you had expended in researching it. Thanks for taking the idea seriously.

      It turns out that I had actually implemented something similar a few years ago for a Random Testcase generator -- though I didn't know they were called Tries. The API set being tested consisted of around 1000 functions, grouped into several sub-apis with each function having a 3-character prefix (Prf..., Win..., Gpi, Dev... etc. *) designating its group. The verbs and adjectives that made up the rest of the names formed a fairly small set, so by substituting a numeric value for each of these and the prefixes, and using these numeric values to build a tree, the storage requirements and the search/traversal times were hugely reduced. This was back in the days when 8 MB machines were big, 16 MB machines almost unheard of and 20 Mhz CPU's were top of the range. We needed to conserve memory, and use smart algorithms rather than brute force, in order that things would run in a reasonable time.

      *They may be familiar to anyone who programmed OS/2.

      I'll make no claims to be fully conversant with Tries (of any flavour) until I've coded one myself from scratch at least 3 or 4 times, but I think that I've understood enough to draw the following conclusions.

      In order for Tries to be space efficient, it requires that the domain of the keys be fairly fully utilised. My example of asciized integers lends itself to this rather nicely resulting in a uniform distribution of the keys across the entire domain. However, if the characters making up the keys used the full 7 or 8-bit range of ascii characters, then one million keys would likely result in a sparsely populated tree, and the amount of unused space in the tree would probably outweight the used space by a fairly large factor. If the keys are Unicode, the problem gets much worse very quickly.

      For example. If the keys consist of the 96 non-control characters in the 7-bit ascii set, and if we had 1 million, 4 character keys, the domain space would potentially be 84,934,656 giving a utilisation of 1/84th of the total domain space. And that's the problem with any data structure that utilises fixed size tables.

      I realise that most Trie implementations, in common with Judy arrays, utilise various, more or less complicated, compression mechanisms to avoid such wastage, but these impose there own limitations and burdens. Judy arrays use a very complicated 6-level compression mechanism. The bit twiddling, branching and special casing required in these schemes means that you couldn't efficiently implement them at the Perl level. If I've understood them correctly, writing an implementation of Tries such that it was either generalised enough to handle this, or could be configured at runtime by passing a parameter would be extremely difficult.

      The nice thing about the mechanism I outlined is that it is quite easily tailored to different domains, by choosing how many (and which) subset of the keys are used for the initial indexing. In the version I currently have coded, this is an unpack format that is used to break the supplied key into 2 or 3 parts. Using cees insight at Re: A (memory) poor man's <strike>hash</strike> lookup table. of removing the index from the key prior to storage reduces the memory consumption further from my original figures. It also speeds up the linear search part as he predicted. For the example million integers, the tie statement looks like this

      tie my %hash, 'Tie::CompactHash', 'A0 A4 A*';

      The 'A0' part indicates that there is no common prefix that can be factored out. The 'A4' indicates that 4 characters (in this case digits) are used as the primary index spreading the 1 million keys over 10,000 scalars (in a real hash). The final 'A*' indicating that the rest of the key is stored within the strings for linear searching. Also stored in the same string is an index into a second string indicating where the value for this key is stored. The values are stored using the 'n/A*' pack format, storing the length and value of the value. This allows both the keys and values to be of variable length, as is the overall size of the hash. A special number is used as the index to indicate that key has no value.

      For the 1 million 4x 7-bit character keys, it would look like this

      tie my %hash, 'Tie::CompactHash', 'A0 A2 A*';

      Here, again there is no common prefix, the index uses 2 characters mapping the million characters over 9216 (96²) strings, with the other two characters stored in the string along with the index into the values string.

      And a final example based on the post I cited in my original post. The problem involved hashing log filenames, which I speculated are often named along the lines of XYZ20031201.log.

      tie my %logs, 'Tie::CompactHash', 'A3 xx A4 A2 A*';

      In this case,

      1. 'A3' is a prefix common to all the log files. This is extracted from the first key passed and stored internally. For subsequent keys, this value is verified against the stored value, but is otherwise discarded.
      2. 'xx' skips over and discards the century. For most applications, ignoring this won't harm.
      3. 'A4' are the low 2 digits of the year, plus the month. For 10 years of logs, this maps to 120 strings.
      4. 'A2' is the day of month ensuring a maximum of 31 subkeys in each string.
      5. 'A*' is a common postfix which is discarded.

      This assumes that the filenames matched will be preselected (using glob or readdir etc) so that those parts ignored will not be significant. This is done to show the flexibility of the mechanism. Internally, the format passed on the tie is stored and then used to break up the key on input. This is efficient and flexible. For example, the exists function is implementated as

      sub EXISTS { my( $pre, $key, $post ) = unpack $mask{ $_[SELF] }, $_[KEY]; die "Prefix mismatch $pre ne $pre{ $_[SELF] }" unless $pre eq $pre{ $_[SELF] }; return defined $_[SELF]{ $key } and $_[SELF]{ $key }[ 0 ] =~ m[(...\c@)\Q$post\E\xff]s; }

      This use of perls standard hash mechanism to handle the sparse arrays problem and regexes for the final linear search, as both are built-in and heavily optimised means that once invoked, they run at the full speed of a C solution.

      I might try coding it in C or possibly assembler and see how it compares for performance, but the major limitation would still the tie interface. I've also tried coding my perl version using an OO interface to see what if any performance difference that made -- it was much of a muchness -- but the syntax of the interface was much less convenient than the tied version. I wish the tie mechanism performed better. It lends itself to so many useful extensions of perls built-in types, it's a shame the costs are so high.

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail