Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Managing a directory with millions of files

by KurtSchwind (Chaplain)
on Jan 28, 2008 at 14:22 UTC ( [id://664687]=note: print w/replies, xml ) Need Help??


in reply to Managing a directory with millions of files

I can see some real usefulness out of the 2nd script, but not so much from the first script.

I guess I think that perl isn't the right tool to get a file count in a dir.

ls /dir/path |wc -l
Will accomplish that. I also use find a lot in those situations. In the 2nd application you talk about, you get a far more robust regex handler than find can do by itself, so that's nice.

--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

Replies are listed 'Best First'.
Re^2: Managing a directory with millions of files
by mpeg4codec (Pilgrim) on Jan 29, 2008 at 05:45 UTC
    The trouble with your approach is in the title: millions of files. On most Linux systems I've used, ls will allocate some amount of memory for each file (presumably so it can sort the listing). Perl could be used as a lightweight wrapper for the opendir library function.

    However, as ruzam pointed out, the first script isn't much better than ls in that regard. His approach is O(1) in the number of files, a variation of which I've personally used in situations similar to jsiren's.

      Hrm. That's an interesting hypothesis.

      Essentially you are saying that the overhead of calling perl and the opendir library is less than the overhead of ls+sort for a large number of files.

      I'm not sure I'm buying that. I'm not sure how to test the memory usage on it though. Give me a day or two and I might just benchmark the memory usage between the two techniques. You could be right, but my gut says no. The inode table is pretty efficient at this kind of thing.

      --
      I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
        It may have been just a hypothesis with anecdotal evidence, but now I have some real data to back it up :).

        First, I'm not saying the overhead of calling perl/opendir is less than ls + sort. I'm saying that the overhead of allocating tons of memory (done by ls, but not Perl) is what makes the difference. Each tool should be approximately O(n) to count the number of files. The hypothesis which I tested is that since I believed Perl to be O(1) in the amount of memory allocated, it would be faster. The data at least confirms that Perl is faster.

        I created three directories with 1e4, 5e4, and 1e5 files respectively. I then ran time ls | wc -l and opendir_count.pl (code below) six times on each, throwing away the time of the first run. I figured that would allow the data from disk to be cached more consistently for each run. The mean of the results, and a convenient graph thereof, are below:

        Files | ls (s) | perl (s) 1e4 0.187 0.109 5e4 0.574 0.204 1e5 1.217 0.300 graph: http://lacklustre.net/images/perl_vs_ls.png
        While there is an obvious linear relationship in the number of files, ls has a much higher constant than Perl in this experiment.

        To test memory usage, I used the indispensible valgrind software. It has a built-in tool called massif that will graph the memory usage of a program over time. The results were as I suspected: perl uses just over 250k regardless of the number of files, while ls took over 1.5 MB for 1e4 files, over 7 MB for 5e4, and almost 14 MB for 1e5. I have compiled the graphs produced by massif, as well.

        I believe the results of my hypothesis were at least somewhat confirmed: Perl is faster than ls as the number of files increases. In this case, testing on relatively small numbers of files (<= 100,000) was enough to demonstrate the fact. For more anecdotal evidence: I have seen ls cause the machine to thrash on millions of files, while Perl held up just fine (although it took a little while to count/remove/etc the files).

        The only way to truly know where the majority of the time is spent by the code is to run it through a profiler. Unfortunately, I don't have debugging symbols in my coreutils binaries, so I can't do that on this system. If you're still unconvinced, I'll leave that to you to test.

        Here is the code I used for opendir_count.pl:

        #!/usr/bin/perl -w use strict; my $path = shift || die; my $count = 0; opendir my $dir, $path or die; while (readdir $dir) { ++$count; } close $dir; print "$count\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://664687]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-23 19:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found