Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: Small Hash a Gateway to Large Hash?

by sundialsvc4 (Abbot)
on Feb 18, 2014 at 14:36 UTC ( #1075344=note: print w/replies, xml ) Need Help??

in reply to Small Hash a Gateway to Large Hash?

I am assuming that you consider that there is a near-100% chance of hash-cache “hits” ... that every one of the 100,000 keys that you plan to search for, probably will be found in the hash.   Therefore, a Bloom filter in this case probably will not tell you anything that you do not already know.

What concerns / interests me with regard to this particular scenario is that there appear to be something like 37.5 million keys in this hash, of which you plan to retrieve only about 100,000.   Therefore, it seems to me that you will spend far more time and resources loading this hash into memory, than you will spend retrieving values from it.   If you are dealing with a time-sensitive problem on a 64-bit(!) computer with copious RAM, such that a 3 5-or-6 GB working-set size can be built and sustained for your process by the operating system for an indefinite period of time, then you will pay a human-noticeable cost (in terms of virtual-memory page faults) while this very large data structure is constructed in memory, but you will find that lookups are instantaneous whether or not collisions occur, because you will be incurring no additional page-fault price if and when they do.   (You already fully-paid the piper while loading it.)

However, to my mind, this is a fairly expensive scenario that, I suggest, ought to be empirically tested on the hardware involved.   What if, say, an SQLite database file (opened in read-only mode, also using transactions, etc.) would, in fact, perform almost as well, or better(?!), in terms of “total runtime start-to-finish?” It very much depends on which part of the process is where a price can be paid, and whether or not, on your particular hardware, you can in fact pay it.   A 32-bit system, for example, usually provides only a 2GB total user-land theoretical address-space to any process that it runs . . .

I suggest that you “don’t fall completely in love with this hash, or hash-of-hashes, idea, until you actually and systematically measure it.   (And, by saying that, I am not “insultingly suggesting” that you did not!)

  • Comment on Re: Small Hash a Gateway to Large Hash?

Replies are listed 'Best First'.
Re^2: Small Hash a Gateway to Large Hash?
by lsherwood (Sexton) on Feb 18, 2014 at 21:46 UTC
    Thanks for some interesting insight.

    I'm not insulted- merely impressed that you figured out I haven't systematically measured a thing, except by staring at my watch. Hey, I inherited this system!

    You are correct also in that loading this hash is a pig: but that happens only once, generally when users are not waiting, so I'm not greatly concerned about it. However, then various files are read and checked against the hash, and that would be nice to speed up.

    I am intrigued by your statement that I will incur no lookup penalty whether or not hash collisions occur. That's contrary to my understanding, and I suppose that is something I should understand.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1075344]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (10)
As of 2018-06-22 08:07 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (122 votes). Check out past polls.