|Welcome to the Monastery|
I forgot to add...
The 'best' way I discovered to store info is to use a base32 encoded digest of the file.
Why? well, most filesystems start to have issues at some point after 1000 items per directory. using a base32 representation, you can hit the sweet spot.
with a base32 formula and 2chars per bin, you'll have 1024 bins per depth ( 32*32 ). using the standard hex based md5 representation, your options are either 256 buckets per depth ( 16*16 ) or 4096 ( 16*16*16 ).
in my personal use, i haven't seen all that much of a difference between 1024 and 4096 buckets -- though i've seen a slight difference. its not as drastic as the performance between either and 10k though.
since i'm lazy and I don't have high performace constraints, i just go with 2 levels deep of 4096 hashing. but if i had more time, I'd definitely go with 3 levels of 1024 hashing.
( the time / lazyness is a factor because every language supports md5 as base16 or base64 by default -- no one has a base32 default , which is a PITA if you're managing an architecture accessed by multiple languages at once ).