Store large hashes more efficiently

puterboy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Store large hashes more efficiently by Tux (Canon) on Feb 12, 2013 at 07:27 UTC
With 10 million, the size is getting in your way, and maybe even causes swapping (I have no idea about your process size limits). If that happens, speed will be your first problem. As you didn't tell other resource limits or process requirements, I'd just wanted to note the I created Tie::Hash::DBD to "fix" a similar problem. In my case, my hash ran into a couple of 100_000 entries, and tieing the hash with DB_File was not a solution, as it could not cope. As I was using a database anyway, I thought I might use it. The hash got a lot slower in the beginning, but the overall process time got halved, and with the option to "keep" the hash in the database, subsequent processes gained a lot. With how you described your problem, Tie::Hash::DBD will probably not solve your problem at hand, but it might be something to look at when it does. Enjoy, Have FUN! H.Merijn	[reply]
Re: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup) by BrowserUk (Patriarch) on Feb 13, 2013 at 05:52 UTC
Any suggestions? This code implements a kind of sparse array. It indexes 10 million MD5s using 32-bit integers using 261MB. (Wrapping it over in a nice API is left as an exercise :): #! perl -slw use strict; use Digest::MD5 qw[ md5 ]; use Devel::Size qw[ total_size ]; use Time::HiRes qw[ time ]; $\|++; our $N //= 10e6; my $inc = int( 232 / $N ); my @lookup; my $c = 0; my $start = time; for( my $i = 0; $i < 232; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; $lookup[ $key ] .= pack 'Va16', $i, $md5; ++$c; } printf "Insertion took: %f seconds\n", time() - $start; print "$c md5s indexed"; $start = time; my( $hits, $misses ) = ( 0,0 ); for( my $i = 0; $i < 232; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; my $p = 0; while( $p = 1+index $lookup[ $key ], pack( 'V', $i ), $p ) { next if ( $p - 1 ) % 20; $md5 eq substr( $lookup[ $key ], $p+3, 16 ) ? ++$hits : ++$misses; last; } } printf "\nLookup took: %f seconds\n", time() - $start; print "hits:$hits misses:$misses"; printf "Memory: %f MB\n", total_size( \@lookup ) / 10242; __END__ C:\test>1018287 -N=10e6 Insertion took: 30.009396 seconds 10011579 md5s indexed Lookup took: 39.267018 seconds hits:10011579 misses:0 Memory: 261.147011 MB C:\test>1018287 -N=20e6 Insertion took: 59.107169 seconds 20069941 md5s indexed Lookup took: 88.747249 seconds hits:20069941 misses:0 Memory: 426.243149 MB [download] Should blow away any disc-based DB mechanism. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup) by puterboy (Scribe) on Feb 26, 2013 at 21:06 UTC
Very clever! If I am understanding this correctly, basically you are masking the 20 LSB so that every 32 bit integer with the same 12 MSB will get packed into the same hash value string, meaning that there could be up to 2^12 = 4096 entries per hash key. However, since I am packing inodes and since inodes are more-or-less sequentially assigned, it would seem that until 2^20 inodes have been assigned, there is no packing. So, wouldn't it be better to mask the MSB, rather than the LSB since the LSB would be relatively randomly uniformly assigned in most disk usage cases. i.e., wouldn't it be better to do something like: `my $key = $i & 0xfffff000;` [download] or even maybe: `my $key = ($i & 0xfffff000) >> 3;` [download] Indeed, your masking may partially explain why for 10e6 indexes, it takes 260MB, while doubling to 20e6 only increases to 426MB. Also, since there are at most 2^12 duplicates, couldn't you save 2 bytes by packing just the parts that are not masked by fffff, so that you could pack it in a 'v', rather than a 'V'. i.e,. in my masking scheme: `$lookup[ $key ] .= pack 'va16', $i & 0xfff, $md5;` [download] I am no perl monk, so I may be missing something of course... Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.	[reply] [d/l] [select]
Re^3: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup) by BrowserUk (Patriarch) on Feb 27, 2013 at 01:00 UTC
Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness. By all means; play. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup) by puterboy (Scribe) on Feb 27, 2013 at 03:28 UTC
Re^5: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup) by BrowserUk (Patriarch) on Feb 27, 2013 at 10:47 UTC
Some notes below your chosen depth have not been shown here
Re: Store large hashes more efficiently by BrowserUk (Patriarch) on Feb 12, 2013 at 04:32 UTC
How many do you need to store? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Store large hashes more efficiently by puterboy (Scribe) on Feb 12, 2013 at 04:43 UTC
o(10 million)	[reply]
Re^3: Store large hashes more efficiently by BrowserUk (Patriarch) on Feb 12, 2013 at 04:53 UTC
Using a standard hash and pack like so, it requires 1.12GB to store the 10e6 key/value pairs: `$h{ pack 'V', $uintKey } = pack 'H*', $32byteHexValue;` [download] Workable on most modern systems. How much smaller are you looking for? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Store large hashes more efficiently by puterboy (Scribe) on Feb 12, 2013 at 14:42 UTC
Re^5: Store large hashes more efficiently by BrowserUk (Patriarch) on Feb 12, 2013 at 15:39 UTC
Some notes below your chosen depth have not been shown here
Re: Store large hashes more efficiently by tmharish (Friar) on Feb 12, 2013 at 14:28 UTC
I came across a similar problem and my performance hit was very similar to the one described by Tux. I got around this by using File::Cache. Considering you seem to be fine with trading run time for memory this might be an option. A slightly faster but more CPU intensive approach ( that I finally settled on ) was to selectively keep elements in the hash - So essentially the solution was a multi level cache with the first level being the hash and elements in that expire based on either time or frequency of access, and when a cache miss on the hash is generated you look it up in File::Cache - Of course the additional CPU usage comes in when handling the removal of expired hash cache elements.	[reply]
Re^2: Store large hashes more efficiently by puterboy (Scribe) on Feb 12, 2013 at 15:24 UTC
Interestingly, my previous iteration of the program did something very similar to File::Cache. I created a multi-level directory decimal tree indexed by the most-significant digits of the index. I then stored the values in the leafs of the lowest branches. The value could then be looked up by reading the corresponding file. I also implemented some simple caching where a hash stored the most recently referenced values to avoid file accesses where possible. However, I found the file accesses to be exceedingly slow (10x penalty) and the cache didn't help me much since there was little correlation between neighboring accesses. In fact, this is what led me to just store everything in a hash since the caching-via-filesystem-tree was too slow. And this is what has led me to try to optimize the storage size of the hash... Still fascinating that I did essentially create my own File::Cache like approach before discarding it...	[reply]


go ahead... be a heretic
	PerlMonks