|Just another Perl shrine|
Re^4: Managing a directory with millions of filesby mpeg4codec (Pilgrim)
|on Jan 29, 2008 at 18:06 UTC||Need Help??|
It may have been just a hypothesis with anecdotal evidence, but now I have some real data to back it up :).
First, I'm not saying the overhead of calling perl/opendir is less than ls + sort. I'm saying that the overhead of allocating tons of memory (done by ls, but not Perl) is what makes the difference. Each tool should be approximately O(n) to count the number of files. The hypothesis which I tested is that since I believed Perl to be O(1) in the amount of memory allocated, it would be faster. The data at least confirms that Perl is faster.
I created three directories with 1e4, 5e4, and 1e5 files respectively. I then ran time ls | wc -l and opendir_count.pl (code below) six times on each, throwing away the time of the first run. I figured that would allow the data from disk to be cached more consistently for each run. The mean of the results, and a convenient graph thereof, are below:
While there is an obvious linear relationship in the number of files, ls has a much higher constant than Perl in this experiment.
To test memory usage, I used the indispensible valgrind software. It has a built-in tool called massif that will graph the memory usage of a program over time. The results were as I suspected: perl uses just over 250k regardless of the number of files, while ls took over 1.5 MB for 1e4 files, over 7 MB for 5e4, and almost 14 MB for 1e5. I have compiled the graphs produced by massif, as well.
I believe the results of my hypothesis were at least somewhat confirmed: Perl is faster than ls as the number of files increases. In this case, testing on relatively small numbers of files (<= 100,000) was enough to demonstrate the fact. For more anecdotal evidence: I have seen ls cause the machine to thrash on millions of files, while Perl held up just fine (although it took a little while to count/remove/etc the files).
The only way to truly know where the majority of the time is spent by the code is to run it through a profiler. Unfortunately, I don't have debugging symbols in my coreutils binaries, so I can't do that on this system. If you're still unconvinced, I'll leave that to you to test.
Here is the code I used for opendir_count.pl: