Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Store large hashes more efficiently

by BrowserUk (Patriarch)
on Feb 12, 2013 at 04:32 UTC ( [id://1018289]=note: print w/replies, xml ) Need Help??


in reply to Store large hashes more efficiently

How many do you need to store?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Store large hashes more efficiently
by puterboy (Scribe) on Feb 12, 2013 at 04:43 UTC
    o(10 million)

      Using a standard hash and pack like so, it requires 1.12GB to store the 10e6 key/value pairs:

      $h{ pack 'V', $uintKey } = pack 'H*', $32byteHexValue;

      Workable on most modern systems. How much smaller are you looking for?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I probably need to use 'Q' rather than 'V', in case the unsigned integer index exceeds 4 billion (or so) -- remember the entries are sparse.

        Now for 1 million entries, I still get ~134 bytes per entry using 'V' and ~138 bytes/entry using Q (which makes sense since I am adding 4 bytes/entry going from 'V' to 'Q').Without any packing of the hash value, I get 186 bytes/entry which adds 48 bytes/entry which again makes sense since in packing a 32-char hex string we go from 64 bytes of storage to 16.

        So, I guess that is as good as I can do with packing. What surprises me I guess is the overhead of the hash itself. i.e., it takes 134 bytes/entry to store 4 bytes of index and 16 bytes of value, which is an overhead of 114 bytes/entry.

        However, looking at how perl seems to store just a plain scalar variable, it seems like it takes 48 bytes of overhead plus the size of the data element. Now looking at 'size' vs. 'total_size' for the hash, I get that 'size' alone (which doesn't include the space to store the hash values) is 74 bytes which after subtracting 4 bytes for storing the packed 4 byte index and 48 bytes for storing a scalar, implies that Perl is using an additional 22 bytes to perform the magic of hash storage and to point to the value stored by the hash index. The 'total_size' then adds an additional 64 bytes which is explained by 48 bytes of scalar overhead plus the 16 bytes required to pack a 32-hex string.

        So, I guess the surprising thing to me is the twice overhead of 48 bytes to store the scalar index and and another 47 to store the scalar value. This 96 bytes (plus the 22 bytes for the hash magic & pointer) really dwarfs the storage of the packed index + value itself (which is 4+16 = 20 bytes). But if my understanding of the 48 byte overhead per scalar is correct, then presumably there is no way to store a large hash more efficiently (and since the indices are very sparse, I can't use an array or other contiguous storage). Am I correct?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1018289]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-03-29 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found