|laziness, impatience, and hubris|
Re^3: Storing/Retrieving images as blobsby MidLifeXis (Monsignor)
|on Sep 26, 2011 at 13:13 UTC||Need Help??|
The OS killer issue alluded to above is when you have a large number of entries in a single directory. In order for the OS to resolve the file name it needs to look up each portion of the file name in the directory. If there are a significant number of entries in a directory, multiple reads on disk may be necessary to find the particular entry.
As an example: assume that you can store 20 directory entries in a block that can be read in a single shot. If you have 100 files, then on average, you will need three disk reads to find the location of a particular file (N/2 comparisons gets you to the third block of five necessary to store 100 entries). This increases to 25 reads for 1000 entries (1000 / 2 => 500, 500 / 20 => 25).
If you can break the directory structure into N parts so that no directory contains more than 20 entries, you can find the file location, on average, in two reads (100 / 20 => 5 entries in each subdirectory). With 2 levels of no more than 20 entries each, you can store and retrieve 400 files with a guaranteed maximum number of reads of 2. 8000 in three, 160K in 4, 3.2M in 5, and so on.
The value of 20 depends on the file system and how directory lookups are done in the OS or the application. A full readdir scan at the application level is less able to be optimized by the OS than having the OS open the file for you. Basically, you need to have a depth of X and number of entries of Y where YX is the number of files you can store (N) before you are willing to reorganize your files.
If the directory entries are actually stored in a database (I think that BeOS had that feature), then you may not need to worry about the read sizes and how many directory entries can fit into a single read.
As an interesting aside, the classical Unix File System used this technique to minimize the number of lookups it had to do to find the data blocks for given files. The inode pointed to by the directory entry had pointers to data blocks for small files, then some pointers to blocks of pointers to data blocks for larger files, and then some pointers to blocks of pointers to blocks of pointers for even larger files. The larger the number of