|Perl: the Markov chain saw|
Hash lookups, Database lookups, and Scalabilityby davido (Cardinal)
|on Oct 31, 2004 at 07:30 UTC||Need Help??|
A day or two ago there was a question posed to Seekers of Perl Wisdom regarding a situation where the OP needed to be able to retain the functionality of tandom lookup hashes, but in an application where the dataset was sufficiently large that tandem lookup hashes were too big to fit in memory.
In this case, when I refer to tandem lookup hashes, what I mean is that the values of one hash are the same as the keys of the other hash, and vice versa. In other words, if I wanted to be able to look up a person by name to determine his occupation, or by occupation to determine his name, I might use two hashes with mirror (opposite) images of each other.
Like a dummy I didn't, at first, see the part of the OP's question where he stated that for one reason or another, a database wasn't a plausable solution. Thus, I forged ahead with the suggestion that he use a database (like I said, ...dummy).
My logic at the time was that a two-column database, with both columns uniquely indexed, would be just about the quickest solution for cross lookups that would still scale well when the dataset grew to a size that couldn't be held in a pair of hashes in memory.
Obviously the database solution is not going to be as efficient as a pair of crossreferencing tandem hashes. But I started wondering just how bad things could get. ...so I set out to create a decent benchmark.
My benchmark script starts by creating 999 word pairs based on unique combinations of the top 999 most commonly used words in the English language. This happened to be a word list that I had sitting on my hard drive, and I find it kind of handy for this sort of tinkering. I've placed the word list at http://davido.perlmonk.org/monastery/t1000f.txt. The script uses LWP::Simple to grab this document and process it.
The script processes the word list by turning it into a bunch of word pairs. The word pairs are then inserted into a database (first-run only. If the DB already exists, this step is skipped). The database has both of its columns indexed for best lookup performance. It is a simple SQLite database.
Next, the script generates a forward and reverse lookup pair of hashes. It reports the size in bytes of these hashes of 999 word/word pairs, just for frame of reference.
Then, the benchmark script tests how fast it can choose at random a key from the %forward hash, and grab its value (repeat this for a few seconds). It then tests how fast it can do a database lookup based on a random key (repeat this test for a few seconds too).
And finally, the comparisons are reported. Here is the output:
As you can see, the hash lookup is 4579% faster than the database lookup. (That's significant.) But next we should test scalability. That's a lot more interesting than comparing a hash to a database.
For that test, I just cut the word list in half. The half-size version is at http://davido.perlmonk.org/monastery/t500f.txt. Commenting out a couple of lines, and uncommenting two others is all it takes to switch the benchmark to use the shorter word list. And here are the results:
We already know that hash lookups are essentially O(1) (no growth as dataset grows). By comparing the two database benchmarks to each other, however, we can learn a little about the order of growth of indexed database lookups as the dataset grows. As we cut the dataset from 999 elements to 499 elements, we see a 5.9% increase in lookup speed. That seems pretty small, but I don't know enough about the SQLite database engine to know what the big-O order of growth of lookup times is as datasets grow. ...perhaps someone could fill in the blank there.
Here is the code used for the benchmarking. Feel free to run it. On first run you'll see a warning as it tests for database existance. Don't worry about that, the warning is just a noisy part of table detection. The script requires DBD::SQLite, DBI, Devel::Size, LWP::Simple, and Benchmark. Enjoy!
In retrospect I wish I had written this in a way that made it easier to compare the small DB lookups to the large-DB lookups, but I didn't at first think about doing that comparison. Note: The script is set to test the 499-element word set. For the 999-word set, just change which declaration lines are commented out at the top of the script. This is a pretty quick-n-dirty script. It's far from elegant. ;) Have fun.