http://www.perlmonks.org?node_id=1020758


in reply to Re: Store larg hashes more efficiently (10e6 md5s in 260MB at 4µs per lookup)
in thread Store large hashes more efficiently

Very clever!

If I am understanding this correctly, basically you are masking the 20 LSB so that every 32 bit integer with the same 12 MSB will get packed into the same hash value string, meaning that there could be up to 2^12 = 4096 entries per hash key.

However, since I am packing inodes and since inodes are more-or-less sequentially assigned, it would seem that until 2^20 inodes have been assigned, there is no packing. So, wouldn't it be better to mask the MSB, rather than the LSB since the LSB would be relatively randomly uniformly assigned in most disk usage cases. i.e., wouldn't it be better to do something like:
my $key = $i & 0xfffff000;
or even maybe:
my $key = ($i & 0xfffff000) >> 3;
Indeed, your masking may partially explain why for 10e6 indexes, it takes 260MB, while doubling to 20e6 only increases to 426MB.

Also, since there are at most 2^12 duplicates, couldn't you save 2 bytes by packing just the parts that are not masked by fffff, so that you could pack it in a 'v', rather than a 'V'. i.e,. in my masking scheme:
$lookup[ $key ] .= pack 'va16', $i & 0xfff, $md5;

I am no perl monk, so I may be missing something of course...

Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.

Replies are listed 'Best First'.
Re^3: Store larg hashes more efficiently (10e6 md5s in 260MB at 4µs per lookup)
by BrowserUk (Patriarch) on Feb 27, 2013 at 01:00 UTC
    Finally, it might be interesting to play with masking different amount of bits to see the space-saving vs. lookup time tradeoffs for different degrees of sparseness.

    By all means; play.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I would be happy to play... But before that I want to make sure that my two suggestions for improvement are valid since I am at best a Perl neophyte...

        I have my own projects to take care of. I've supplied you with code that measures both; and you have the problem.

        The only way for either of us to determine whether your suggestions are improvements; is to try them: do they make lookups slower or faster; or use more or less memory?

        If I do it, you remain a neophyte. If you do it, you become less of a neophyte.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.