http://www.perlmonks.org?node_id=1018483


in reply to Store large hashes more efficiently

Any suggestions?

This code implements a kind of sparse array. It indexes 10 million MD5s using 32-bit integers using 261MB.

(Wrapping it over in a nice API is left as an exercise :):

#! perl -slw use strict; use Digest::MD5 qw[ md5 ]; use Devel::Size qw[ total_size ]; use Time::HiRes qw[ time ]; $|++; our $N //= 10e6; my $inc = int( 2**32 / $N ); my @lookup; my $c = 0; my $start = time; for( my $i = 0; $i < 2**32; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; $lookup[ $key ] .= pack 'Va16', $i, $md5; ++$c; } printf "Insertion took: %f seconds\n", time() - $start; print "$c md5s indexed"; $start = time; my( $hits, $misses ) = ( 0,0 ); for( my $i = 0; $i < 2**32; $i += $inc ) { my $key = $i & 0xfffff; my $md5 = md5 $i; my $p = 0; while( $p = 1+index $lookup[ $key ], pack( 'V', $i ), $p ) { next if ( $p - 1 ) % 20; $md5 eq substr( $lookup[ $key ], $p+3, 16 ) ? ++$hits : ++$misses; last; } } printf "\nLookup took: %f seconds\n", time() - $start; print "hits:$hits misses:$misses"; printf "Memory: %f MB\n", total_size( \@lookup ) / 1024**2; __END__ C:\test>1018287 -N=10e6 Insertion took: 30.009396 seconds 10011579 md5s indexed Lookup took: 39.267018 seconds hits:10011579 misses:0 Memory: 261.147011 MB C:\test>1018287 -N=20e6 Insertion took: 59.107169 seconds 20069941 md5s indexed Lookup took: 88.747249 seconds hits:20069941 misses:0 Memory: 426.243149 MB

Should blow away any disc-based DB mechanism.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Store larg hashes more efficiently (10e6 md5s in 260MB at 4µs per lookup)
  • Download Code

Replies are listed 'Best First'.
Re^2: Store larg hashes more efficiently (10e6 md5s in 260MB at 4µs per lookup)
by puterboy (Scribe) on Feb 26, 2013 at 21:06 UTC
    Very clever!

    If I am understanding this correctly, basically you are masking the 20 LSB so that every 32 bit integer with the same 12 MSB will get packed into the same hash value string, meaning that there could be up to 2^12 = 4096 entries per hash key.

    However, since I am packing inodes and since inodes are more-or-less sequentially assigned, it would seem that until 2^20 inodes have been assigned, there is no packing. So, wouldn't it be better to mask the MSB, rather than the LSB since the LSB would be relatively randomly uniformly assigned in most disk usage cases. i.e., wouldn't it be better to do something like:
    my $key = $i & 0xfffff000;
    or even maybe:
    my $key = ($i & 0xfffff000) >> 3;
    Indeed, your masking may partially explain why for 10e6 indexes, it takes 260MB, while doubling to 20e6 only increases to 426MB.

    Also, since there are at most 2^12 duplicates, couldn't you save 2 bytes by packing just the parts that are not masked by fffff, so that you could pack it in a 'v', rather than a 'V'. i.e,. in my masking scheme:
    $lookup[ $key ] .= pack 'va16', $i & 0xfff, $md5;

    I am no perl monk, so I may be missing something of course...

    Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.
      Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.

      By all means; play.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I would be happy to play... But before that I want to make sure that my two suggestions for improvement are valid since I am at best a Perl neophyte...