Re: Managing a directory with millions of files

Replies are listed 'Best First'.
Re^2: Managing a directory with millions of files by mpeg4codec (Pilgrim) on Jan 29, 2008 at 05:45 UTC
The trouble with your approach is in the title: millions of files. On most Linux systems I've used, ls will allocate some amount of memory for each file (presumably so it can sort the listing). Perl could be used as a lightweight wrapper for the opendir library function. However, as ruzam pointed out, the first script isn't much better than ls in that regard. His approach is O(1) in the number of files, a variation of which I've personally used in situations similar to jsiren's.	[reply]
Re^3: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 29, 2008 at 13:21 UTC
Hrm. That's an interesting hypothesis. Essentially you are saying that the overhead of calling perl and the opendir library is less than the overhead of ls+sort for a large number of files. I'm not sure I'm buying that. I'm not sure how to test the memory usage on it though. Give me a day or two and I might just benchmark the memory usage between the two techniques. You could be right, but my gut says no. The inode table is pretty efficient at this kind of thing. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]
Re^4: Managing a directory with millions of files by mpeg4codec (Pilgrim) on Jan 29, 2008 at 18:06 UTC
It may have been just a hypothesis with anecdotal evidence, but now I have some real data to back it up :). First, I'm not saying the overhead of calling perl/opendir is less than ls + sort. I'm saying that the overhead of allocating tons of memory (done by ls, but not Perl) is what makes the difference. Each tool should be approximately O(n) to count the number of files. The hypothesis which I tested is that since I believed Perl to be O(1) in the amount of memory allocated, it would be faster. The data at least confirms that Perl is faster. I created three directories with 1e4, 5e4, and 1e5 files respectively. I then ran time ls \| wc -l and opendir_count.pl (code below) six times on each, throwing away the time of the first run. I figured that would allow the data from disk to be cached more consistently for each run. The mean of the results, and a convenient graph thereof, are below: `Files \| ls (s) \| perl (s) 1e4 0.187 0.109 5e4 0.574 0.204 1e5 1.217 0.300 graph: http://lacklustre.net/images/perl_vs_ls.png` [download] While there is an obvious linear relationship in the number of files, ls has a much higher constant than Perl in this experiment. To test memory usage, I used the indispensible valgrind software. It has a built-in tool called massif that will graph the memory usage of a program over time. The results were as I suspected: perl uses just over 250k regardless of the number of files, while ls took over 1.5 MB for 1e4 files, over 7 MB for 5e4, and almost 14 MB for 1e5. I have compiled the graphs produced by massif, as well. I believe the results of my hypothesis were at least somewhat confirmed: Perl is faster than ls as the number of files increases. In this case, testing on relatively small numbers of files (<= 100,000) was enough to demonstrate the fact. For more anecdotal evidence: I have seen ls cause the machine to thrash on millions of files, while Perl held up just fine (although it took a little while to count/remove/etc the files). The only way to truly know where the majority of the time is spent by the code is to run it through a profiler. Unfortunately, I don't have debugging symbols in my coreutils binaries, so I can't do that on this system. If you're still unconvinced, I'll leave that to you to test. Here is the code I used for opendir_count.pl: `#!/usr/bin/perl -w use strict; my $path = shift \|\| die; my $count = 0; opendir my $dir, $path or die; while (readdir $dir) { ++$count; } close $dir; print "$count\n";` [download]	[reply] [d/l] [select]
Re^5: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 29, 2008 at 18:31 UTC
Re^5: Managing a directory with millions of files by mr_mischief (Monsignor) on Jan 30, 2008 at 21:11 UTC


There's more than one way to do things
	PerlMonks