Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Small Hash a Gateway to Large Hash?

by lsherwood (Beadle)
on Feb 18, 2014 at 00:03 UTC ( #1075242=perlquestion: print w/replies, xml ) Need Help??

lsherwood has asked for the wisdom of the Perl Monks concerning the following question:

Can one or more wise monks advise on this issue?

I have a very large hash (over 3 GB) that may be accessed 100,000 times or more during the life of a script (once for each record input to the script). Instead of accessing the large hash so many times, each time I run the script I could build a much smaller hash, with at most a few thousand keys, based on the initial response to the search of the large hash. I could then see if a key exists in the small hash before even referencing the large hash.

Since hash access is normally O(1), this may be a non-starter of an idea. But the large hash does have collisions, and I would like to find a way to speed up my script (no, haven't done any profiling). For what it's worth, the values in the large hash are strings of perhaps 80 characters.

Now if I created a small hash, it would contain only a dozen or so values, so I am concerned about hash collisions. One way to avoid collisions in my small hash would be to make each value an array (probably a reference to an anonymous array) and make the second element of the array a unique value. That seems rather kludgy to me, but I should think it would eliminate hash collisions at the price of significant overhead in creating the smaller hash.

So, to my questions. Is building a small hash based on results of accessing a large hash likely to help speed up my script? If so, what would be the best way to ensure that the smaller hash has unique values given that the values of interest are just short ASCII stings that are not at all unique?

A blessing on all monks with insight.

Replies are listed 'Best First'.
Re: Small Hash a Gateway to Large Hash?
by LanX (Archbishop) on Feb 18, 2014 at 00:32 UTC
    You shouldn't need to care about collisions, if the keys are random and not pathologically designed in a way to enforce collisions.

    A hash normally just grows if there are too many entries. So still ~ O(1) !

    IMHO your problem is not really speed but size. I once had the problem of a giant hash which was constantly swapping. So each access was limited by the speed of my hard-disk (ugly bottleneck).

    I was able to solve that by transforming it into a a HoH and splitting the old key into two halves.

    i.e.  $new{long}{_key} = $old{long_key}

    this worked because I was able to make the algorithm run thru sorted keys, i.e. the "lower" hash needed to be loaded only once into RAM when the corresponding upper key long was processed.

    This way Perl only needed to keep two hashes in memory. This is quite scalable...the only limitation is the size of your hard disk then.

    So if you can manage your accesses (reads and writes) in a way that "long_keys" with the same "start" are bundled this is my recommendation (within the limited infos you gave).

    If order doesn't matter this should be easy!

    HTH! :)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Rolf-

      Your solution is intriguing. We don't see evidence of swapping to disk, but certainly is something to keep in mind.

        • No swapping means your hash is kept in RAM.

        • Noticeable collisions are very unlikely so accessing this hash can't be done faster.

        • You said the hash is only build once and start-up time is no problem, so writing and rehashing doesn't byte you.

        • You said you are reading and processing many files... thats the most likely place for optimization.

        But without profiling this is only shooting in the dark

        Could it be that you are parsing the files and checking if entries match against the hash?

        Then trie-optimized regexes could be very performant... (you might need runs with several regexes to avoid buffer overruns, but 10000 short strings should fit)

        But w/o much info that's again shooting in the dark.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Re: Small Hash a Gateway to Large Hash?
by davido (Cardinal) on Feb 18, 2014 at 02:17 UTC

    What is your goal for moving away from accessing the 3GB hash? Is it the obvious, of not holding so much memory hostage, or is it something less obvious such as speed performance? Why is it useful to you to hold smaller hashes in memory? Is it the case that you don't really need immediate random access to the entire 3GB hash at any given moment? What does the script actually do?

    Would it be reasonable to build an SQLite table so that you're not holding it all in memory at once? SQLite is pretty fast. Or how about a each hash element being represented, instead, by a MongoDB doc?


    Dave

      Thanks Dave-

      Interesting suggestion- I've asked before about putting a relational db like Postgres on the machine, and was told "nyet". But this is simply a perl package, and nobody needs to know it's almost an rdb.

      That said, the reason I'm thinking into this at all is to make the script run faster. Without any hard data, I'm fairly confident this gigantic hash is a bottleneck.

        You already know that hash lookups have an O(1) order of growth in computational complexity as the hash grows. Collisions aren't that big of a deal; the chains, even for gigantic hashes, are kept pretty small. Rehashing is a more significant problem, but that only takes place as the hash grows, not once it has arrived at its finished state.

        Maybe if we knew more about the nature of your "big hash" data set we might find ways of pre-processing or re-structuring it for faster searches. But in the "general" case, it's hard to beat what you've got already.

        One generalized solution would be to use workers: share your hash by creating it first, and then forking worker processes -- possibly one or two per processor core. As long as they never write to the original data, the "copy on write" mechanism won't be engaged, and they will share the data set. Let them keep track of their own tallies, and when they're done, let them report back to the parent script who can reduce the work each worker provided back down to a single final answer.

        There may exist less general, but more optimal solutions for your specific data set, but we would need to have a better understanding of that data set before we could include or eliminate such strategies.


        Dave

Re: Small Hash a Gateway to Large Hash?
by robby_dobby (Friar) on Feb 18, 2014 at 06:57 UTC
    Hello Isherwood,
    I have a very large hash (over 3 GB) that may be accessed 100,000 times or more during the life of a script (once for each record input to the script). Instead of accessing the large hash so many times, each time I run the script I could build a much smaller hash, with at most a few thousand keys, based on the initial response to the search of the large hash. I could then see if a key exists in the small hash before even referencing the large hash.

    This is exactly what a Bloom filter is for. Once you have built a sufficiently large hash, performance suffers since you have to look up a key over the large hash. Bloom filters help offset some of the cost by introducing probabilistic lookups. IOW, the cost of checking for existence of a key is pretty small compared to performing the actual lookup. Have a look at this old perl.com article and there are a few CPAN modules available too - namely, Bloom::Filter and Bloom::Faster.

    HTH, good luck!

Re: Small Hash a Gateway to Large Hash?
by dsheroh (Prior) on Feb 18, 2014 at 09:36 UTC
    (no, haven't done any profiling)
    Well, there's your problem right there.

    If a fronting hash would make a difference, you're talking about something that's pretty far into the realm of micro-optimizations. And you should never bother with micro-optimizations unless you've profiled first to see that there's a benefit to be gained by micro-optimizing.

    In this particular case, my intuition is that the cost of doing the preliminary hash lookup will be slower than walking the linked list in the bucket where a collision happened anyhow, so, if your idea had any noticeable effect, I suspect it would probably be to make things slower (and more memory-intensive) rather than faster.

    But, again, profile before spending any more time contemplating it. 100,000 hash lookups probably take less time than you'll spend reading this sentence, even if they are lookups into a 3GB hash. Even if you could magically optimize that down to zero time, it won't noticeably impact the overall program's performance.

Re: Small Hash a Gateway to Large Hash?
by oiskuu (Hermit) on Feb 18, 2014 at 08:24 UTC

    What you are thinking of is commonly called a cache. :-)

    Caches make (a lot of) sense, whenever the data has temporal and/or spatial locality. Which is the typical case.

    Now the thing is, your computer already employs caching at multiple levels. An additional caching scheme may still bring benefits if it makes your working set more compact. But there may be opportunities to arrange your large data in a way that improves locality. It depends on the problem. Also, many algorithms have their "cache oblivious" versions.

      Yes, it is a cache.

      And you guys are, to use a trite, overworked word

      AWESOME!

Re: Small Hash a Gateway to Large Hash?
by Anonymous Monk on Feb 18, 2014 at 08:19 UTC
    Memory footprint of your hash doesn't matter so long as it fits in memory. Once it fits in memory it is unlikely that a fronting hash will make any difference except to slow things down. It isn't clear what you mean by collisions, but if you mean two keys ending up in the same bucket of the hash, then you should just let Perl deal with it. The buckets of a hash are a linked list, likely short, and perl will be relatively efficient in traversing that list.
      Thank, 'doc,

      Yes, that is what I meant (multiple keys in one "bucket"). I'm inclined to think you are correct- most likely I'd harm, not enhance, performance.

Re: Small Hash a Gateway to Large Hash?
by BrowserUk (Pope) on Feb 18, 2014 at 08:25 UTC

    Can you give a couple of examples of typical keys and values; and the total number of keys in the hash?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      A lab version of the hash has nearly 150,000 keys; a full production version may have three million or so.

      Keys are simply short strings of integers. Values also are short strings (50 or so bytes) containing timezone and some geoloc information.

        Keys are simply short strings of integers.

        Hm. Does that mean each key is a single short integer, or each key contains several integers?

        Here's why I ask. The following build two hashes with 1 million key/value pairs.

        1. In this one the keys are strings containing 3 x 7 digit integers; and the values are 10 x 7 digits integers delimited by spaces:
          C:\test>p1 Perl> #5.6MB ;; Perl> $h{ "$_ " x 3} = "$_ " x 10 for 1e6 .. 2e6;; Perl> # 299MB ;;

          1 million key/value pairs 300MB

        2. Same thing, but packing the integers to 4-byte binary values:
          C:\test>p1 Perl> # 5.6MB ;; Perl> $h{ pack 'V3', ($_) x 3} = pack 'V10', ($_) x 10 for 1e6 .. 2e6; +; Perl> # 237MB ;;

          Voila! You've stored the same information and saved 1/5th of the space.

        If the numbers can be stored as shorts or bytes, you can save even more.

        I'm also having trouble reconciling your OP mention of a 3GB hash with the description of key/value sizes and total numbers above?

        This creates a 3 million key/value pair hash with 21-byte keys and 80-byte values and the total memory requirement is <1GB:

        Perl> $h{ "$_ " x 3} = "$_ " x 10 for 1e6 .. 4e6;; Perl> # 912MB ;;

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Small Hash a Gateway to Large Hash?
by sundialsvc4 (Abbot) on Feb 18, 2014 at 14:36 UTC

    I am assuming that you consider that there is a near-100% chance of hash-cache “hits” ... that every one of the 100,000 keys that you plan to search for, probably will be found in the hash.   Therefore, a Bloom filter in this case probably will not tell you anything that you do not already know.

    What concerns / interests me with regard to this particular scenario is that there appear to be something like 37.5 million keys in this hash, of which you plan to retrieve only about 100,000.   Therefore, it seems to me that you will spend far more time and resources loading this hash into memory, than you will spend retrieving values from it.   If you are dealing with a time-sensitive problem on a 64-bit(!) computer with copious RAM, such that a 3 5-or-6 GB working-set size can be built and sustained for your process by the operating system for an indefinite period of time, then you will pay a human-noticeable cost (in terms of virtual-memory page faults) while this very large data structure is constructed in memory, but you will find that lookups are instantaneous whether or not collisions occur, because you will be incurring no additional page-fault price if and when they do.   (You already fully-paid the piper while loading it.)

    However, to my mind, this is a fairly expensive scenario that, I suggest, ought to be empirically tested on the hardware involved.   What if, say, an SQLite database file (opened in read-only mode, also using transactions, etc.) would, in fact, perform almost as well, or better(?!), in terms of “total runtime start-to-finish?” It very much depends on which part of the process is where a price can be paid, and whether or not, on your particular hardware, you can in fact pay it.   A 32-bit system, for example, usually provides only a 2GB total user-land theoretical address-space to any process that it runs . . .

    I suggest that you “don’t fall completely in love with this hash, or hash-of-hashes, idea, until you actually and systematically measure it.   (And, by saying that, I am not “insultingly suggesting” that you did not!)

      Thanks for some interesting insight.

      I'm not insulted- merely impressed that you figured out I haven't systematically measured a thing, except by staring at my watch. Hey, I inherited this system!

      You are correct also in that loading this hash is a pig: but that happens only once, generally when users are not waiting, so I'm not greatly concerned about it. However, then various files are read and checked against the hash, and that would be nice to speed up.

      I am intrigued by your statement that I will incur no lookup penalty whether or not hash collisions occur. That's contrary to my understanding, and I suppose that is something I should understand.

Re: Small Hash a Gateway to Large Hash?
by scorpio17 (Canon) on Feb 18, 2014 at 14:33 UTC
    A hash is like a simple in-memory database. If your data becomes too large to fit into memory, you should probably consider switching to a real database (sqlite, mysql, postgres, etc.).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1075242]
Approved by Old_Gray_Bear
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2019-11-12 22:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (66 votes). Check out past polls.

    Notices?