Re: Store larg hashes more efficiently (10e6 md5s in 260MB at 4盜 per lookup)

in reply to Store large hashes more efficiently

Any suggestions?

This code implements a kind of sparse array. It indexes 10 million MD5s using 32-bit integers using 261MB.

(Wrapping it over in a nice API is left as an exercise :):

#! perl -slw
use strict;
use Digest::MD5 qw[ md5 ];
use Devel::Size qw[ total_size ];
use Time::HiRes qw[ time ];

$|++;
our $N //= 10e6;
my $inc = int( 2**32 / $N );

my @lookup;
my $c = 0;
my $start = time;
for( my $i = 0; $i < 2**32; $i += $inc ) {
    my $key = $i & 0xfffff;
    my $md5 = md5 $i;
    $lookup[ $key ] .= pack 'Va16', $i, $md5;
    ++$c;
}
printf "Insertion took: %f seconds\n", time() - $start;
print "$c md5s indexed";

$start = time;
my( $hits, $misses ) = ( 0,0 );
for( my $i = 0; $i < 2**32; $i += $inc ) {
    my $key = $i & 0xfffff;
    my $md5 = md5 $i;
    my $p = 0;
    while( $p = 1+index $lookup[ $key ], pack( 'V', $i ), $p ) {
        next if ( $p - 1 ) % 20;
        $md5 eq substr( $lookup[ $key ], $p+3, 16 )
            ? ++$hits
            : ++$misses;
        last;
    }
}
printf "\nLookup took: %f seconds\n", time() - $start;
print "hits:$hits misses:$misses";
printf "Memory: %f MB\n", total_size( \@lookup ) / 1024**2;

__END__
C:\test>1018287 -N=10e6
Insertion took: 30.009396 seconds
10011579 md5s indexed

Lookup took: 39.267018 seconds
hits:10011579 misses:0
Memory: 261.147011 MB

C:\test>1018287 -N=20e6
Insertion took: 59.107169 seconds
20069941 md5s indexed

Lookup took: 88.747249 seconds
hits:20069941 misses:0
Memory: 426.243149 MB
[download]

Should blow away any disc-based DB mechanism.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In Section Seekers of Perl Wisdom