Re: Efficient processing of large directory

It worked fine up to its planned limit of around 3000 files, but it's been too successful and the client now has 17,000 files in there!

A side note first: Some operating systems are really, really bad with directories that large. If your client is interested in performance, they (or you, on their behalf) may want to do a bit of performance prototyping. Recent work on FreeBSD has greatly improved its large directory performance, for whatever that's worth.

For your problem, I see two options: The first is to use File::Find to locate all *.txt files, and process them one-by-one. The second is to use opendir()/readdir()/closedir() to read the directory directly, filename by filename. Either one will avoid you having to hold on to a large temporary array.

You can find plenty of examples of each by using Super Search to look for "File::Find" or "opendir".

Comment on Re: Efficient processing of large directory Download Code

Replies are listed 'Best First'.

Re: Re: Efficient processing of large directory
by BrowserUk (Patriarch) on Oct 03, 2003 at 00:10 UTC

It's worth noting that if your trying to find a subset of the files contained in a subdir, rather than processing them all, then using <*.txt> is considerably faster that using either File::Find or opendir/readdir/closedir. At least that is the case under Win32 as the wildcard matching is done by the OS and only those files matching are past back.

In the examples below, the first comparison shows selecting all 17576 files in a subdirectory. In this case, glob and File::Find come out pretty much even.

In the second comparison, a subset of 676 files is selected from the 17000 using a wildcard. In this case, the glob runs 650% faster as it is only processing the 676, rather than looping over the whole 17000+.

Of course, if any real processing was being done rather than just counting the files, the difference would rapidly disappear.

In this case, the OP's use of the word "efficient" was most likely to do with the memory used by slurping all 17000 names in to memory rather than speed, but if not all those 17000 file are .txt files, the time saved might be worth having.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks