http://www.perlmonks.org?node_id=1075344


in reply to Small Hash a Gateway to Large Hash?

I am assuming that you consider that there is a near-100% chance of hash-cache “hits” ... that every one of the 100,000 keys that you plan to search for, probably will be found in the hash.   Therefore, a Bloom filter in this case probably will not tell you anything that you do not already know.

What concerns / interests me with regard to this particular scenario is that there appear to be something like 37.5 million keys in this hash, of which you plan to retrieve only about 100,000.   Therefore, it seems to me that you will spend far more time and resources loading this hash into memory, than you will spend retrieving values from it.   If you are dealing with a time-sensitive problem on a 64-bit(!) computer with copious RAM, such that a 3 5-or-6 GB working-set size can be built and sustained for your process by the operating system for an indefinite period of time, then you will pay a human-noticeable cost (in terms of virtual-memory page faults) while this very large data structure is constructed in memory, but you will find that lookups are instantaneous whether or not collisions occur, because you will be incurring no additional page-fault price if and when they do.   (You already fully-paid the piper while loading it.)

However, to my mind, this is a fairly expensive scenario that, I suggest, ought to be empirically tested on the hardware involved.   What if, say, an SQLite database file (opened in read-only mode, also using transactions, etc.) would, in fact, perform almost as well, or better(?!), in terms of “total runtime start-to-finish?” It very much depends on which part of the process is where a price can be paid, and whether or not, on your particular hardware, you can in fact pay it.   A 32-bit system, for example, usually provides only a 2GB total user-land theoretical address-space to any process that it runs . . .

I suggest that you “don’t fall completely in love with this hash, or hash-of-hashes, idea, until you actually and systematically measure it.   (And, by saying that, I am not “insultingly suggesting” that you did not!)