Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

is it possible to have too many files

by Anonymous Monk
on Jan 21, 2011 at 14:58 UTC ( [id://883543]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I was asked to break a file into 10,000 files based on a key column to speedup access. Is there any problem with having 10,000 files in a directory in linux? Only one file open at a time!

Replies are listed 'Best First'.
Re: is it possible to have too many files
by MidLifeXis (Monsignor) on Jan 21, 2011 at 15:18 UTC

    It depends on the file system. If possible, you may want to break it down into a couple of directory levels.

    • Some file systems choke on large directories (including classical unix directories).
    • If you have to scan the directory to find the file that you want (depends, again, on the structure used on the file system), you will end up, on average, with N/2 comparisons to find the file you want. If you break this down into one or two additional levels of subdirectories, it can become a log(N) N1/2^d function instead.

    Let's assume that your key column is a 4 digit number (0000-9999) - if you store 100 files in a directory, you would store the files for key 1200-1299 in 1/12/$key.

    Update: It is possible that the file system is already a database (IIRC, BeOS had something like this), in which case, you may not have a problem with a large directory. I think that this is typically not the case.

    Update 2: See also Re: Efficient processing of large directory, another node that references qmail's use of multi-level directories for its queuing system, and the reasoning behind it.

    Update 3: Fixed big-oh of calculations. Thanks JavaFan.

    --MidLifeXis

      If you break this down into one or two additional levels of subdirectories, it can become a log(N) function instead.
      With one layer of subdirectories, it'll become the square root of N. With two layers, it'll become the fourth root of N.

      You'd need Ω(log N) layers of directories to bring it down to O(log N) search time.

        Fixed. Thanks.

        --MidLifeXis

Re: is it possible to have too many files (history)
by tye (Sage) on Jan 21, 2011 at 18:16 UTC

    It used to be that Unix file systems almost always implemented directories as just a linear array of entries. In such a system, having 10,000 files in a (single) directory could easily make a simple "ls" take many minutes to run and make even stat of a single file take quite a long time.

    If you deployed a Unix system today, it is very likely that you'd get file systems that implement directories that include a tree-based index so that finding an entry for a named file is O(log $N) when there are $N files in the directory rather than O($N). So having 10,000 files in a directory would have roughly (some small multiple of) the performance of having 13 files in an old-style directory.

    So, if you have a file system that was created two decades ago (and not recreated since), then you should be very worried about putting 10,000 files into a single directory. If you have a more modern file system, then it is likely that 10,000 files in a single directory is not a huge problem. Even if you are sure that you have a modern file system, you should still test the performance impact of your 10,000-file solution before committing to it.

    - tye        

Re: is it possible to have too many files (block size)
by Anonyrnous Monk (Hermit) on Jan 21, 2011 at 19:46 UTC

    You haven't said what size your file(s) are, but another consideration might be that using lots of small files is often wasting considerable disk space, because of the granularity (block size) the filesystem uses for storing data.  Files smaller than the block size still need one block to be stored.

    Say your 10000 files are 100 bytes each, then with a moderate block size of 4096, you'd be wasting around 97% of the disk space, compared to storing the data in one file — and even block sizes of 8K or 16K aren't unusual these days.  In other words, there are always trade-offs...

Re: is it possible to have too many files
by eff_i_g (Curate) on Jan 21, 2011 at 15:10 UTC
    1. Why aren't you using a database?
    2. If a database is out of the picture, can't you divvy these into subdirectories instead of one large one?
    3. Unix systems have inode limits; google to find specific information on your distro.
      Unix systems have inode limits; google to find specific information on your distro
      You need to have a pretty small filesystem so that 10,000 inodes becomes a problem.

      Also, the number of inodes is typically dependent on the size of your file system, although it can be set manually when creating the filesystem. To find out how many inodes your filesystem has, and how many available, instead of googling, it maybe a lot easier to run df -i. (My ext3 filesystems seem to have about a quarter of a million inodes per Gb)

Re: is it possible to have too many files
by Corion (Patriarch) on Jan 21, 2011 at 15:46 UTC
Re: is it possible to have too many files
by ahmad (Hermit) on Jan 21, 2011 at 15:35 UTC

    Well, Yesterday I tried storing around 3,000 files on a directory & it worked.

    The simplest answer to your question is try it.

    you can do something like this:

    for ( 1..10_000 ) { open(F,">dir/$_.txt"); close(F); }
    if you end up with 10,000 files without any problem ... then it works ... otherwise it doesn't.

Re: is it possible to have too many files
by sundialsvc4 (Abbot) on Jan 21, 2011 at 18:10 UTC

    What I usually do is, first, to do all lookup of information using a database; not the file system.   Then, for storage of those items, I use a nested-subdirectory structure.

    So, for example, “problem report #1234567” might be stored in /probs/123/456/prob1234567.xml.   And the database-entry would contain this full pathname to this file.

    This (arbitrary example) filing arrangement makes it easy to locate things (whether you are using the database or doing something manually), and it entirely avoids any issues associated with putting an impractical number of filenames in a single directory.

    Since the database contains a full pathname, it also leaves open the possibility of rearranging the information in the future, or of making some impromptu change or exception to it.   The item’s location is set when the item is stored, and the item is properly (and arbitrarily) cataloged.   Much like we did when libraries actually had card catalogs (and paper books).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://883543]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-29 01:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found