Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Managing a directory with millions of files

by mpeg4codec (Pilgrim)
on Jan 29, 2008 at 05:45 UTC ( #664849=note: print w/ replies, xml ) Need Help??


in reply to Re: Managing a directory with millions of files
in thread Managing a directory with millions of files

The trouble with your approach is in the title: millions of files. On most Linux systems I've used, ls will allocate some amount of memory for each file (presumably so it can sort the listing). Perl could be used as a lightweight wrapper for the opendir library function.

However, as ruzam pointed out, the first script isn't much better than ls in that regard. His approach is O(1) in the number of files, a variation of which I've personally used in situations similar to jsiren's.


Comment on Re^2: Managing a directory with millions of files
Re^3: Managing a directory with millions of files
by KurtSchwind (Hermit) on Jan 29, 2008 at 13:21 UTC

    Hrm. That's an interesting hypothesis.

    Essentially you are saying that the overhead of calling perl and the opendir library is less than the overhead of ls+sort for a large number of files.

    I'm not sure I'm buying that. I'm not sure how to test the memory usage on it though. Give me a day or two and I might just benchmark the memory usage between the two techniques. You could be right, but my gut says no. The inode table is pretty efficient at this kind of thing.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
      It may have been just a hypothesis with anecdotal evidence, but now I have some real data to back it up :).

      First, I'm not saying the overhead of calling perl/opendir is less than ls + sort. I'm saying that the overhead of allocating tons of memory (done by ls, but not Perl) is what makes the difference. Each tool should be approximately O(n) to count the number of files. The hypothesis which I tested is that since I believed Perl to be O(1) in the amount of memory allocated, it would be faster. The data at least confirms that Perl is faster.

      I created three directories with 1e4, 5e4, and 1e5 files respectively. I then ran time ls | wc -l and opendir_count.pl (code below) six times on each, throwing away the time of the first run. I figured that would allow the data from disk to be cached more consistently for each run. The mean of the results, and a convenient graph thereof, are below:

      Files | ls (s) | perl (s) 1e4 0.187 0.109 5e4 0.574 0.204 1e5 1.217 0.300 graph: http://lacklustre.net/images/perl_vs_ls.png
      While there is an obvious linear relationship in the number of files, ls has a much higher constant than Perl in this experiment.

      To test memory usage, I used the indispensible valgrind software. It has a built-in tool called massif that will graph the memory usage of a program over time. The results were as I suspected: perl uses just over 250k regardless of the number of files, while ls took over 1.5 MB for 1e4 files, over 7 MB for 5e4, and almost 14 MB for 1e5. I have compiled the graphs produced by massif, as well.

      I believe the results of my hypothesis were at least somewhat confirmed: Perl is faster than ls as the number of files increases. In this case, testing on relatively small numbers of files (<= 100,000) was enough to demonstrate the fact. For more anecdotal evidence: I have seen ls cause the machine to thrash on millions of files, while Perl held up just fine (although it took a little while to count/remove/etc the files).

      The only way to truly know where the majority of the time is spent by the code is to run it through a profiler. Unfortunately, I don't have debugging symbols in my coreutils binaries, so I can't do that on this system. If you're still unconvinced, I'll leave that to you to test.

      Here is the code I used for opendir_count.pl:

      #!/usr/bin/perl -w use strict; my $path = shift || die; my $count = 0; opendir my $dir, $path or die; while (readdir $dir) { ++$count; } close $dir; print "$count\n";

        Nice job. I like the analysis done. You beat me to it.

        I also like the graph. Nicely done.

        --
        I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
        GNU ls has an option to disable sorting. IIRC, it's -U to do that.

        Still, with Perl you have one process with parsing and compilation overhead vs. the pipe to a second process and the same linear count of files. With wc, the count is as text. With Perl, you're just incrementing on anything that's defined with no string comparisons. I'm curious how that changes things.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://664849]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-07-13 17:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (251 votes), past polls