Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Scaling Hash Limits

by Endless (Beadle)
on Sep 19, 2013 at 10:11 UTC ( [id://1054822]=perlquestion: print w/replies, xml ) Need Help??

Endless has asked for the wisdom of the Perl Monks concerning the following question:

I'm wondering about how big I can make a hash before it ceases to be effective; are there any tips or guides on this point? Obviously, the answer may be machine-specific with regards to physical resources, but further help may be useful. In my application, my simple hash of scalars (for duplicate detection) hits a size of over 55 million, at which point my system seems to be dying of asphyxiation. Are there any heuristic watermarks about when a hash may stop being effective in Perl?

Replies are listed 'Best First'.
Re: Scaling Hash Limits
by Laurent_R (Canon) on Sep 19, 2013 at 11:00 UTC

    The short answer is that access to hash elements (almost) does not depend on the number of elements in the hash (it is said to be O(1)), so long as you don't saturate the size of your available memory.

    One qualification however: there can be cases where a hash become pathological, but these are extremely rare (if ever) with the current implementations of hashes in Perl. For practical purposes, the limit is really the quantity of available RAM, taking into account the fact that the hash structure takes quite a bit more memory space than the sheer amount of data you store in it.

Re: Scaling Hash Limits
by CountZero (Bishop) on Sep 19, 2013 at 12:05 UTC
    55 million items in your hash? One wonders where all this data is coming from. It will take quite a while just to read in that many items.

    If you have to re-do the duplicate check a few times (say every-time new data is added to the set), perhaps now is the moment to store the data in a database and use the database-engine as your duplicate-detector.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      I'm going through a stock of twitter messages. There are around 180 million total, and I want to make sure I don't have duplicates (based on 12-digit id numbers).
        12-digit id numbers

        Do you know the approximate lowest and highest numbers?

        If so, you can reduce the memory requirement for de-duping your 180 million ids from ~ 4 gigabytes to ~ 8 megabytes.

        And make it much faster also.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Database.

Re: Scaling Hash Limits
by sundialsvc4 (Abbot) on Sep 19, 2013 at 13:42 UTC

    Leveraging CountZero’s comment especially here:   “55 million records” definitely need to be indexed to be useful, but an in-memory hash table (might or ...) might not be the best way to do it, if only because it obliges the entire data structure to be “in memory,” with all of the potential (and, rather unpredictable) impact of virtual-memory paging should the system become memory-constrained.   (Also, and mostly for this reason, it does not scale well ...)

    Yes, a hash-table, or tables, probably is the best “in-memory” choice.   The design question is ... is “in-memory” the best choice?   I suggest that it might not be.

    If you put these data into any sort of database, the records are, first of all, stored on disk, which has unlimited capacity.   And yet, the data are indexed ... also on-disk.   The database can locate the records-of-interest and then bring those into memory for further processing as needed.   If the database grows to 100 or 1,000 times this number of records ... well, it will just take a little longer, but it won’t fail.   You want to design systems that do not degrade severely (and unexpectedly) as volume increases.   “In-memory” approaches are notorious for “hit the wall and go splat” when memory runs short and paging kicks in.   So, you either need to establish confidence that this won’t happen to you in-production at worst-loads, or design in some other way to mitigate that business risk.

      Where can I buy these unlimited capacity disks?

        Where can I buy these unlimited capacity disks?

        Well, not unlimited, but what about 50 petabytes? Except for some custom cases and standard PSUs with custom wirings, you can get that at the next few computer parts stores. Don't forget to buy some racks, too. ;-)

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1054822]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 23:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found