|No such thing as a small change|
Sorry for taking so long to get back to you but I had to find out what Tries and Patricia Tries were before I could understand your post and appreciate the effort that you had expended in researching it. Thanks for taking the idea seriously.
It turns out that I had actually implemented something similar a few years ago for a Random Testcase generator -- though I didn't know they were called Tries. The API set being tested consisted of around 1000 functions, grouped into several sub-apis with each function having a 3-character prefix (Prf..., Win..., Gpi, Dev... etc. *) designating its group. The verbs and adjectives that made up the rest of the names formed a fairly small set, so by substituting a numeric value for each of these and the prefixes, and using these numeric values to build a tree, the storage requirements and the search/traversal times were hugely reduced. This was back in the days when 8 MB machines were big, 16 MB machines almost unheard of and 20 Mhz CPU's were top of the range. We needed to conserve memory, and use smart algorithms rather than brute force, in order that things would run in a reasonable time.
*They may be familiar to anyone who programmed OS/2.
I'll make no claims to be fully conversant with Tries (of any flavour) until I've coded one myself from scratch at least 3 or 4 times, but I think that I've understood enough to draw the following conclusions.
In order for Tries to be space efficient, it requires that the domain of the keys be fairly fully utilised. My example of asciized integers lends itself to this rather nicely resulting in a uniform distribution of the keys across the entire domain. However, if the characters making up the keys used the full 7 or 8-bit range of ascii characters, then one million keys would likely result in a sparsely populated tree, and the amount of unused space in the tree would probably outweight the used space by a fairly large factor. If the keys are Unicode, the problem gets much worse very quickly.
For example. If the keys consist of the 96 non-control characters in the 7-bit ascii set, and if we had 1 million, 4 character keys, the domain space would potentially be 84,934,656 giving a utilisation of 1/84th of the total domain space. And that's the problem with any data structure that utilises fixed size tables.
I realise that most Trie implementations, in common with Judy arrays, utilise various, more or less complicated, compression mechanisms to avoid such wastage, but these impose there own limitations and burdens. Judy arrays use a very complicated 6-level compression mechanism. The bit twiddling, branching and special casing required in these schemes means that you couldn't efficiently implement them at the Perl level. If I've understood them correctly, writing an implementation of Tries such that it was either generalised enough to handle this, or could be configured at runtime by passing a parameter would be extremely difficult.
The nice thing about the mechanism I outlined is that it is quite easily tailored to different domains, by choosing how many (and which) subset of the keys are used for the initial indexing. In the version I currently have coded, this is an unpack format that is used to break the supplied key into 2 or 3 parts. Using cees insight at Re: A (memory) poor man's <strike>hash</strike> lookup table. of removing the index from the key prior to storage reduces the memory consumption further from my original figures. It also speeds up the linear search part as he predicted. For the example million integers, the tie statement looks like this
The 'A0' part indicates that there is no common prefix that can be factored out. The 'A4' indicates that 4 characters (in this case digits) are used as the primary index spreading the 1 million keys over 10,000 scalars (in a real hash). The final 'A*' indicating that the rest of the key is stored within the strings for linear searching. Also stored in the same string is an index into a second string indicating where the value for this key is stored. The values are stored using the 'n/A*' pack format, storing the length and value of the value. This allows both the keys and values to be of variable length, as is the overall size of the hash. A special number is used as the index to indicate that key has no value.
For the 1 million 4x 7-bit character keys, it would look like this
Here, again there is no common prefix, the index uses 2 characters mapping the million characters over 9216 (96²) strings, with the other two characters stored in the string along with the index into the values string.
And a final example based on the post I cited in my original post. The problem involved hashing log filenames, which I speculated are often named along the lines of XYZ20031201.log.
In this case,
This assumes that the filenames matched will be preselected (using glob or readdir etc) so that those parts ignored will not be significant. This is done to show the flexibility of the mechanism. Internally, the format passed on the tie is stored and then used to break up the key on input. This is efficient and flexible. For example, the exists function is implementated as
This use of perls standard hash mechanism to handle the sparse arrays problem and regexes for the final linear search, as both are built-in and heavily optimised means that once invoked, they run at the full speed of a C solution.
I might try coding it in C or possibly assembler and see how it compares for performance, but the major limitation would still the tie interface. I've also tried coding my perl version using an OO interface to see what if any performance difference that made -- it was much of a muchness -- but the syntax of the interface was much less convenient than the tied version. I wish the tie mechanism performed better. It lends itself to so many useful extensions of perls built-in types, it's a shame the costs are so high.