Re^3: Managing a directory with millions of files

Replies are listed 'Best First'.
Re^4: Managing a directory with millions of files by mpeg4codec (Pilgrim) on Jan 29, 2008 at 18:06 UTC
It may have been just a hypothesis with anecdotal evidence, but now I have some real data to back it up :). First, I'm not saying the overhead of calling perl/opendir is less than ls + sort. I'm saying that the overhead of allocating tons of memory (done by ls, but not Perl) is what makes the difference. Each tool should be approximately O(n) to count the number of files. The hypothesis which I tested is that since I believed Perl to be O(1) in the amount of memory allocated, it would be faster. The data at least confirms that Perl is faster. I created three directories with 1e4, 5e4, and 1e5 files respectively. I then ran time ls \| wc -l and opendir_count.pl (code below) six times on each, throwing away the time of the first run. I figured that would allow the data from disk to be cached more consistently for each run. The mean of the results, and a convenient graph thereof, are below: `Files \| ls (s) \| perl (s) 1e4 0.187 0.109 5e4 0.574 0.204 1e5 1.217 0.300 graph: http://lacklustre.net/images/perl_vs_ls.png` [download] While there is an obvious linear relationship in the number of files, ls has a much higher constant than Perl in this experiment. To test memory usage, I used the indispensible valgrind software. It has a built-in tool called massif that will graph the memory usage of a program over time. The results were as I suspected: perl uses just over 250k regardless of the number of files, while ls took over 1.5 MB for 1e4 files, over 7 MB for 5e4, and almost 14 MB for 1e5. I have compiled the graphs produced by massif, as well. I believe the results of my hypothesis were at least somewhat confirmed: Perl is faster than ls as the number of files increases. In this case, testing on relatively small numbers of files (<= 100,000) was enough to demonstrate the fact. For more anecdotal evidence: I have seen ls cause the machine to thrash on millions of files, while Perl held up just fine (although it took a little while to count/remove/etc the files). The only way to truly know where the majority of the time is spent by the code is to run it through a profiler. Unfortunately, I don't have debugging symbols in my coreutils binaries, so I can't do that on this system. If you're still unconvinced, I'll leave that to you to test. Here is the code I used for opendir_count.pl: `#!/usr/bin/perl -w use strict; my $path = shift \|\| die; my $count = 0; opendir my $dir, $path or die; while (readdir $dir) { ++$count; } close $dir; print "$count\n";` [download]	[reply] [d/l] [select]
Re^5: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 29, 2008 at 18:31 UTC
Nice job. I like the analysis done. You beat me to it. I also like the graph. Nicely done. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]
Re^5: Managing a directory with millions of files by mr_mischief (Monsignor) on Jan 30, 2008 at 21:11 UTC
GNU ls has an option to disable sorting. IIRC, it's -U to do that. Still, with Perl you have one process with parsing and compilation overhead vs. the pipe to a second process and the same linear count of files. With wc, the count is as text. With Perl, you're just incrementing on anything that's defined with no string comparisons. I'm curious how that changes things.	[reply]


Problems? Is your data what you think it is?
	PerlMonks