Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re (tilly) 1: I'm falling asleep here

by Fletch (Bishop)
on Oct 21, 2001 at 05:31 UTC ( [id://120346]=note: print w/replies, xml ) Need Help??


in reply to Re (tilly) 1: I'm falling asleep here
in thread -s takes too long on 15,000 files

No, ls does do calls to stat (and/or lstat). Here's a chunk of the output from running truss ls -l . on FreeBSD:

$ truss ls -l |& grep stat ... lstat("Makefile",0x809d24c) = 0 (0x0) lstat("cmp.c",0x809d348) = 0 (0x0) lstat("extern.h",0x809d44c) = 0 (0x0) lstat("ls.1",0x809d548) = 0 (0x0) ...

The output will look very similar on other OSen as well. Depending on the implementation of ls it's either calling opendir and stat itself, or it's using a library (fts(3) on FreeBSD) to traverse things. But underneath they're all doing the same thing and making the same system calls.

You are correct that many Unix filesystems don't cope well with large numbers of entries since it is a linear scan to locate something within a given directory (more than a few hundred may show performance problems, something over 1k will more than likely do so).

A solution to this (if you can make changes to the way things are structured on disk but can't change to a fancier filesystem) is to use the initial characters of each entry as a key and have a second level hierarchy of subdirs for each key. For example if your files are named with hexadecimal numbers, you might have a top level directory with dirs `00' and `01' through `fe', `ff'. A file named `ace0beef' would be located in `toplevel/ac/ace0beef'. If those first directories start getting crowded, just add another layer of subdirs underneath each of those. As long as you abstract out the mapping you can change the underlying storage hierarchy without having to alter your programs.

Replies are listed 'Best First'.
Re (tilly) 3: I'm falling asleep here
by tilly (Archbishop) on Oct 21, 2001 at 06:34 UTC
    The example code was running on Windows, not Unix.

    However your other points are true. I admit that I am guessing as to how the Windows dir function is running so much faster than the simple Perl shown.

    But one note. It may be that your filenames don't divide well based on the first few characters. (So one directory has a ton of directories, the rest do not.) In that case the above scheme can be improved by first taking an MD5 hash of the filename, and then placing files into directory locations depending on the characters in the MD5 hash.

    (-:At which point your on-disk data storage is starting to be a frozen form of efficient data structures you might learn about in an algorithms course...:-)

      If only it was that simple... it just needs to be all in one directory for another program that parses them (that I didn't write).

      Like I said, ls on Linux takes about 10 seconds for all those files, so it still leaves me guessing.

      I'm also thinking MD5 is a bit of an overkill on 15000 * rather short filenames, a simple rotating hash would be quicker. And more efficient.

      =)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://120346]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-26 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found