Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Efficient processing of large directory

by Elliott (Pilgrim)
on Oct 02, 2003 at 16:31 UTC ( [id://295965]=perlquestion: print w/replies, xml ) Need Help??

Elliott has asked for the wisdom of the Perl Monks concerning the following question:

Brethren,

I have need to process a directory full of text files. It worked fine up to its planned limit of around 3000 files, but it's been too successful and the client now has 17,000 files in there!

The code currenlty looks like this

foreach$file(<directory/*.txt>) { do stuff }

Even if "do stuff" is merely to count them - or even just give the name of the first one and then exit; the process takes forever and often times out. (This is under Apache / Linux.) I assume the problem is that Perl(? or the OS?) has to read the entire directory listing into memory in order to structure the loop.

My thought is to restructure the directory into, say, 100 subdirectories and then process thus:

foreach$subdir(<directory/*>) { foreach$file(<$subdir/*.txt>) { do stuff } }

Will that help? My thought is that only two arrays will be created at any one time, one of 100 subdirectories and one of 170 files in the current subdirectory?

Or is there a much better way to do this?

BTW, the client tells me they hope to have 50,000 files in there soon...

Replies are listed 'Best First'.
Re: Efficient processing of large directory (n-tier directories and the Grubby Pages Effect)
by grinder (Bishop) on Oct 02, 2003 at 17:16 UTC

    By definition, one cannot process a large directory efficiently :)

    Find a way to key the file names so that you can move to a multi-level directory structure. This might be the first two characters or last two characters of the filename:

    filename key result

    dabhds.txt d d/dabhds.txt
    xyzzy.txt x x/xyzzy.txt

    43816.txt 16 16/43816.txt
    73813.txt 13 13/73813.txt

    The main point to remember is that given the filename, you can derive the directory it should be in. And if it gets moved to the wrong directory, you can check for it programmatically.

    I worked on a web site like yours once. When I was called in, there were more than 600 000 files sitting in one directory. This was on a Linux 2.0 kernel on an ext2 filesystem. The directory entry itself was over 3 megabytes. Needless to say performance suffered...

    We managed to get it into a three-level directory structure (7/78/78123.txt) but it tooks hours of time at the kernel level because the directory traversal was so slow.

    Bite the bullet and reorganise your files, before it's too late!

    Note that the filenames are numeric (1232.txt etc. etc.) then you want to key on the last digits, not the first digits, otherwise you'll skew the number of files per directory to the low-numbered ones because of the Grubby Pages Effect (more links here),

Re: Efficient processing of large directory
by dws (Chancellor) on Oct 02, 2003 at 17:19 UTC
    It worked fine up to its planned limit of around 3000 files, but it's been too successful and the client now has 17,000 files in there!

    A side note first: Some operating systems are really, really bad with directories that large. If your client is interested in performance, they (or you, on their behalf) may want to do a bit of performance prototyping. Recent work on FreeBSD has greatly improved its large directory performance, for whatever that's worth.

    For your problem, I see two options: The first is to use File::Find to locate all *.txt files, and process them one-by-one. The second is to use opendir()/readdir()/closedir() to read the directory directly, filename by filename. Either one will avoid you having to hold on to a large temporary array.

    You can find plenty of examples of each by using Super Search to look for "File::Find" or "opendir".

      It's worth noting that if your trying to find a subset of the files contained in a subdir, rather than processing them all, then using <*.txt> is considerably faster that using either File::Find or opendir/readdir/closedir. At least that is the case under Win32 as the wildcard matching is done by the OS and only those files matching are past back.

      In the examples below, the first comparison shows selecting all 17576 files in a subdirectory. In this case, glob and File::Find come out pretty much even.

      In the second comparison, a subset of 676 files is selected from the 17000 using a wildcard. In this case, the glob runs 650% faster as it is only processing the 676, rather than looping over the whole 17000+.

      Of course, if any real processing was being done rather than just counting the files, the difference would rapidly disappear.

      In this case, the OP's use of the word "efficient" was most likely to do with the memory used by slurping all 17000 names in to memory rather than speed, but if not all those 17000 file are .txt files, the time saved might be worth having.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Efficient processing of large directory
by MidLifeXis (Monsignor) on Oct 02, 2003 at 17:25 UTC

    In addition to the while suggestion above, depending on the filesystem you use, your suggestion to hash the directories can be a good one, especially if the system has to read multiple disk blocks to find the file you are looking for.

    The same concept has been applied to mail spools (qmail), and suggested to help speed up access to home directories on hosts with large numbers of "users".

    With the number of files you are considering, you probably want to consider something along the lines of NUMFILES < SPLIT ** DEPTH where SPLIT is the number of subdirectories that can fit in one disk block, and DEPTH is how deep your directory structure should go. Once you get to the point where NUMFILES is larger, then you start needing multiple directory reads to find the file you need to open.

    Add this to the while suggestion above, and you should be able to access each individual file (such as an open() call) as quickly (Update: adj -> adv) as the OS can handle it.

    Of course, this is all IIRC, and it has been a while since I have applied this in my studies.

    Now this is all based on older file systems (inode, UFS style, chained directory block style, etc). The newer filesystems (btree, reiser?) may not have this "problem" anymore.

    Update: Fixed spelling mistakes
Re: Efficient processing of large directory
by BrowserUk (Patriarch) on Oct 02, 2003 at 16:43 UTC

    Use while instead.

    while( my $file = <directory/*> ) { # do stuff }

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      Thanks for the tip - but why? (Most of all I want it to work - but I also want to understand)
        It has to do with how foreach (and for, which is an exact synonym) works. foreach will construct the entire list, then iterate through it. This can be very memory-intensive, which will slow the processing speed (due to cache misses and virtual memory issues.) A nearly exact rewrite of foreach in terms of while would look something like:
        foreach my $n (<*.*>) { # do stuff } ---- my @list = <*.*>; my $i = 0; while ($i <= $#list) { my $n = $list[$i]; # do stuff } continue { $i++; }

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      I've tried it now with while ... and it timed out :-(

      Looks like I'd better try subdirectories too.

        You should readdir instead. Also, if this is running from a CGI (I guess that's what timing out is referring to), then make sure to give the client a few bytes of data every now and then so it doesn't give up waiting.

        Makeshifts last the longest.

Re: Efficient processing of large directory
by jdtoronto (Prior) on Oct 02, 2003 at 17:28 UTC
    I understand your problem, I routinely have directories of around 45,000 text files.

    The problem with timeouts can also be the Apache. I am not sure of the exact mechanism. But when I process large directories I send something to the server regularly to 'keep it awake' and keep it outputting - and thus not timing the CGI process out. That way I have single CGI scripts that run sometimes for 12 or 13 hours and work nicely.

    The other alternative is to have the CGI script launch the script doing all the work, in this way the timeout is no longer an issue, the script can do its work quietly in the background.

    Hope that helps...

    jdtoronto

Re: Efficient processing of large directory
by liz (Monsignor) on Oct 02, 2003 at 18:50 UTC
    ... Or is there a much better way to do this?...

    If you have the possibility of re-formatting a partition, have a look at ReiserFS. It will do the "hashing" of filenames for you and be much more efficient in handling diskspace.

    Liz

Re: Efficient processing of large directory
by Elliott (Pilgrim) on Oct 03, 2003 at 13:36 UTC
    Thank you all for your advice. My reading of your answers is that two approaches would help:
    • Reorganise into subdirectories as I originally thought
    • Use while instead of foreach
    I have already converted to using while but I haven't had the chance to check for improvement yet. Can I have opinions on whether using both solutions together is worthwhile (pun not intended!)?

    BTW, the file names are email addresses (opt-in list, no spam here I promise!!) with \W characters removed. I was planning to pick 2nd and 4th chars to name the subdirectories in order to avoid grubbiness. Any thoughts on that?

      Switch to a dbm like DB_File instead of lots of small files. Particularly if you use a BTree, you will get much better organized usage of disk.

      But do keep text backups in case an upgrade breaks DB_File.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://295965]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-19 08:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found