Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Store large hashes more efficiently

by puterboy (Scribe)
on Feb 12, 2013 at 03:55 UTC ( #1018287=perlquestion: print w/ replies, xml ) Need Help??
puterboy has asked for the wisdom of the Perl Monks concerning the following question:

I am using a large hash (millions of entries) as a cache.

The keys are (sparsely spaced) unsigned integers.
The values are 32 hex char string followed by an optional unsigned integer.

Storing the hash normally as $hash{<u_integer>}=<32 hex> takes about 190 bytes per entry (as determined by using Devel::Size).

However, knowing the format it seems like I should be able to pack the 32 hex characters into 16 bytes. Similarly the optional unsigned integer string, should be able to be packed more efficiently than a char string.

Perhaps, I could even save on the key storage, knowing that they are unsigned integers.

I am looking to do a better "packing" not any fancy compression scheme. Note I tried various combinations of pack such as pack("H32l", <32hex>, <uint>) but it got me only about a 25% saving. There must be a better way of packing (assuming I am willing to sacrifice a little speed). I mean if the key is o(4) bytes and the value is o(16-20) bytes, I would think I could do better than taking o(150-200) bytes which is almost a 90% overhead. Or maybe hashes are by necessity that inefficient...

Any suggestions?

Comment on Store large hashes more efficiently
Download Code
Re: Store large hashes more efficiently
by BrowserUk (Pope) on Feb 12, 2013 at 04:32 UTC

    How many do you need to store?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      o(10 million)

        Using a standard hash and pack like so, it requires 1.12GB to store the 10e6 key/value pairs:

        $h{ pack 'V', $uintKey } = pack 'H*', $32byteHexValue;

        Workable on most modern systems. How much smaller are you looking for?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Store large hashes more efficiently
by Tux (Monsignor) on Feb 12, 2013 at 07:27 UTC

    With 10 million, the size is getting in your way, and maybe even causes swapping (I have no idea about your process size limits). If that happens, speed will be your first problem.

    As you didn't tell other resource limits or process requirements, I'd just wanted to note the I created Tie::Hash::DBD to "fix" a similar problem. In my case, my hash ran into a couple of 100_000 entries, and tieing the hash with DB_File was not a solution, as it could not cope. As I was using a database anyway, I thought I might use it. The hash got a lot slower in the beginning, but the overall process time got halved, and with the option to "keep" the hash in the database, subsequent processes gained a lot.

    With how you described your problem, Tie::Hash::DBD will probably not solve your problem at hand, but it might be something to look at when it does.


    Enjoy, Have FUN! H.Merijn
Re: Store large hashes more efficiently
by tmharish (Friar) on Feb 12, 2013 at 14:28 UTC

    I came across a similar problem and my performance hit was very similar to the one described by Tux.

    I got around this by using File::Cache. Considering you seem to be fine with trading run time for memory this might be an option.

    A slightly faster but more CPU intensive approach ( that I finally settled on ) was to selectively keep elements in the hash - So essentially the solution was a multi level cache with the first level being the hash and elements in that expire based on either time or frequency of access, and when a cache miss on the hash is generated you look it up in File::Cache - Of course the additional CPU usage comes in when handling the removal of expired hash cache elements.

      Interestingly, my previous iteration of the program did something very similar to File::Cache. I created a multi-level directory decimal tree indexed by the most-significant digits of the index. I then stored the values in the leafs of the lowest branches. The value could then be looked up by reading the corresponding file.

      I also implemented some simple caching where a hash stored the most recently referenced values to avoid file accesses where possible.

      However, I found the file accesses to be exceedingly slow (10x penalty) and the cache didn't help me much since there was little correlation between neighboring accesses. In fact, this is what led me to just store everything in a hash since the caching-via-filesystem-tree was too slow. And this is what has led me to try to optimize the storage size of the hash...

      Still fascinating that I did essentially create my own File::Cache like approach before discarding it...
Re: Store larg hashes more efficiently (10e6 md5s in 260MB at 4s per lookup)
by BrowserUk (Pope) on Feb 13, 2013 at 05:52 UTC
    Any suggestions?

    This code implements a kind of sparse array. It indexes 10 million MD5s using 32-bit integers using 261MB.

    (Wrapping it over in a nice API is left as an exercise :):

    #! perl -slw use strict; use Digest::MD5 qw[ md5 ]; use Devel::Size qw[ total_size ]; use Time::HiRes qw[ time ]; $|++; our $N //= 10e6; my $inc = int( 2**32 / $N ); my @lookup; my $c = 0; my $start = time; for( my $i = 0; $i < 2**32; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; $lookup[ $key ] .= pack 'Va16', $i, $md5; ++$c; } printf "Insertion took: %f seconds\n", time() - $start; print "$c md5s indexed"; $start = time; my( $hits, $misses ) = ( 0,0 ); for( my $i = 0; $i < 2**32; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; my $p = 0; while( $p = 1+index $lookup[ $key ], pack( 'V', $i ), $p ) { next if ( $p - 1 ) % 20; $md5 eq substr( $lookup[ $key ], $p+3, 16 ) ? ++$hits : ++$misses; last; } } printf "\nLookup took: %f seconds\n", time() - $start; print "hits:$hits misses:$misses"; printf "Memory: %f MB\n", total_size( \@lookup ) / 1024**2; __END__ C:\test>1018287 -N=10e6 Insertion took: 30.009396 seconds 10011579 md5s indexed Lookup took: 39.267018 seconds hits:10011579 misses:0 Memory: 261.147011 MB C:\test>1018287 -N=20e6 Insertion took: 59.107169 seconds 20069941 md5s indexed Lookup took: 88.747249 seconds hits:20069941 misses:0 Memory: 426.243149 MB

    Should blow away any disc-based DB mechanism.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Very clever!

      If I am understanding this correctly, basically you are masking the 20 LSB so that every 32 bit integer with the same 12 MSB will get packed into the same hash value string, meaning that there could be up to 2^12 = 4096 entries per hash key.

      However, since I am packing inodes and since inodes are more-or-less sequentially assigned, it would seem that until 2^20 inodes have been assigned, there is no packing. So, wouldn't it be better to mask the MSB, rather than the LSB since the LSB would be relatively randomly uniformly assigned in most disk usage cases. i.e., wouldn't it be better to do something like:
      my $key = $i & 0xfffff000;
      or even maybe:
      my $key = ($i & 0xfffff000) >> 3;
      Indeed, your masking may partially explain why for 10e6 indexes, it takes 260MB, while doubling to 20e6 only increases to 426MB.

      Also, since there are at most 2^12 duplicates, couldn't you save 2 bytes by packing just the parts that are not masked by fffff, so that you could pack it in a 'v', rather than a 'V'. i.e,. in my masking scheme:
      $lookup[ $key ] .= pack 'va16', $i & 0xfff, $md5;

      I am no perl monk, so I may be missing something of course...

      Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.
        Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.

        By all means; play.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1018287]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-09-20 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (155 votes), past polls