Re: Re (tilly) 1: I'm falling asleep here

No, ls does do calls to stat (and/or lstat). Here's a chunk of the output from running truss ls -l . on FreeBSD:

$ truss ls -l |& grep stat
...
lstat("Makefile",0x809d24c)                      = 0 (0x0)
lstat("cmp.c",0x809d348)                         = 0 (0x0)
lstat("extern.h",0x809d44c)                      = 0 (0x0)
lstat("ls.1",0x809d548)                          = 0 (0x0)
...
[download]

The output will look very similar on other OSen as well. Depending on the implementation of ls it's either calling opendir and stat itself, or it's using a library (fts(3) on FreeBSD) to traverse things. But underneath they're all doing the same thing and making the same system calls.

You are correct that many Unix filesystems don't cope well with large numbers of entries since it is a linear scan to locate something within a given directory (more than a few hundred may show performance problems, something over 1k will more than likely do so).

A solution to this (if you can make changes to the way things are structured on disk but can't change to a fancier filesystem) is to use the initial characters of each entry as a key and have a second level hierarchy of subdirs for each key. For example if your files are named with hexadecimal numbers, you might have a top level directory with dirs `00' and `01' through `fe', `ff'. A file named `ace0beef' would be located in `toplevel/ac/ace0beef'. If those first directories start getting crowded, just add another layer of subdirs underneath each of those. As long as you abstract out the mapping you can change the underlying storage hierarchy without having to alter your programs.

Comment on Re: Re (tilly) 1: I'm falling asleep here Download Code

Replies are listed 'Best First'.
Re (tilly) 3: I'm falling asleep here by tilly (Archbishop) on Oct 21, 2001 at 06:34 UTC
The example code was running on Windows, not Unix. However your other points are true. I admit that I am guessing as to how the Windows dir function is running so much faster than the simple Perl shown. But one note. It may be that your filenames don't divide well based on the first few characters. (So one directory has a ton of directories, the rest do not.) In that case the above scheme can be improved by first taking an MD5 hash of the filename, and then placing files into directory locations depending on the characters in the MD5 hash. (-:At which point your on-disk data storage is starting to be a frozen form of efficient data structures you might learn about in an algorithms course...:-)	[reply]
Re: (tilly) I'm falling asleep here by ishk0 (Acolyte) on Oct 21, 2001 at 07:36 UTC
If only it was that simple... it just needs to be all in one directory for another program that parses them (that I didn't write). Like I said, ls on Linux takes about 10 seconds for all those files, so it still leaves me guessing. I'm also thinking MD5 is a bit of an overkill on 15000 * rather short filenames, a simple rotating hash would be quicker. And more efficient. =)	[reply]


Problems? Is your data what you think it is?
	PerlMonks