Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

RE: RE: Descending through directories

by Corion (Pope)
on May 30, 2000 at 19:54 UTC ( #15425=note: print w/ replies, xml ) Need Help??


in reply to RE: Descending through directories
in thread Getting a List of Files Via Glob

I've never used perlfunc:rewinddir() :), and my post wasn't meant as an offence, sorry if I came across that way ...


Comment on RE: RE: Descending through directories
RE: RE: RE: Descending through directories
by t0mas (Priest) on May 31, 2000 at 10:22 UTC
    No offence taken. I think its a good thing to discuss/show different ways to solve the same problem, and I guess we all have our own toolkits of code snippets that we throw into every program we write.
    Maybe I'll try to benchmark some of the ways when I'll find some time.

    /brother t0mas
RE: RE: RE: Descending through directories
by t0mas (Priest) on Jun 02, 2000 at 00:20 UTC
    Did some benchmarking today. I really like knowing the most effective way to solve a certain problem and when Corion posted his code above, I got curious. To Corion, I would like to say that this is no "I'm right - you're wrong" kind of thing. I've enjoyed your code (since I love and use eConsole) for a long time, and I really didn't know which of the ways that was most effective, so please don't take this the wrong way.
    If someone else have ideas about this please, give it a shot with your own code.
    I use directory travering quite often so I would really be glad to be able to use the most effective code in my programs.
    Here we go:
    use Benchmark; use File::Spec; use File::Find; $t0 = new Benchmark; &t1('C:\\Program'); $t1 = new Benchmark; &t2('C:\\Program'); $t2 = new Benchmark; &t3('C:\\Program'); $t3 = new Benchmark; &t4('C:\\Program'); $t4 = new Benchmark; &t5('C:\\Program'); $t5 = new Benchmark; print "t1: ",timestr(timediff($t1, $t0)),"\n"; print "t2: ",timestr(timediff($t2, $t1)),"\n"; print "t3: ",timestr(timediff($t3, $t2)),"\n"; print "t4: ",timestr(timediff($t4, $t3)),"\n"; print "t5: ",timestr(timediff($t5, $t4)),"\n"; # Opens a dirhandle to read files, another to read sub-dirs and # recursive calls itself foreach subdir it finds sub t1 { my $Dir = shift; opendir(DIR, $Dir) || die "Can't opendir $Dir: $!"; my @Files = grep { /.txt/ && -f "$Dir/$_" } readdir(DIR); closedir DIR; opendir(DIR, $Dir) || die "Can't opendir $Dir: $!"; my @Dirs = grep { /^[^.].*/ && -d "$Dir/$_" } readdir(DIR); closedir DIR; foreach $file (@Files) { print $Dir."-".$file."\n"; } foreach $SubDir (@Dirs) { &t1(join("\\",$Dir,$SubDir)); } }; # Opens a dirhandle to read files, rewinds to read sub-dirs and # recursive calls itself foreach subdir it finds sub t2 { my $Dir = shift; opendir(DIR, $Dir) || die "Can't opendir $Dir: $!"; my @Files = grep { /.txt/ && -f "$Dir/$_" } readdir(DIR); rewinddir(DIR); my @Dirs = grep { /^[^.].*/ && -d "$Dir/$_" } readdir(DIR); closedir DIR; foreach $file (@Files) { print $Dir."-".$file."\n"; } foreach $SubDir (@Dirs) { &t2(join("\\",$Dir,$SubDir)); } }; # Opens a dirhandle to read all directory contents and # recursive calls itself foreach subdir it finds # Uses File::Spec, which makes it portable sub t3 { my ($Dir) = shift; my ($entry,@direntries,$fullpath); opendir( DIR, $Dir ) or die "Can't opendir $Dir: $!"; @direntries = readdir( DIR ) or die "Error reading $Dir : $!\n"; closedir DIR; foreach $entry (@direntries) { next if $entry =~ /^\.\.?$/; $fullpath = File::Spec->catfile( $Dir, $entry ); if (-d $fullpath ) { &t3($fullpath); } elsif ( -f $fullpath && $entry =~ /.txt/) { print $Dir."-".$entry."\n"; } } }; # Opens a dirhandle to read all directory contents and # recursive calls itself foreach subdir it finds sub t4 { my ($Dir) = shift; my ($entry,@direntries,$fullpath); opendir( DIR, $Dir ) or die "Can't opendir $Dir: $!"; @direntries = readdir( DIR ) or die "Error reading $Dir : $!\n"; closedir DIR; foreach $entry (@direntries) { next if $entry =~ /^\.\.?$/; $fullpath = join("\\",$Dir,$entry); if (-d $fullpath ) { &t4($fullpath); } elsif ( -f $fullpath && $entry =~ /.txt/) { print $Dir."-".$entry."\n"; } } }; # Uses File::Find (whatever it does...) sub t5 { my ($Dir) = shift; find(\&found, $Dir); } sub found { /.txt/ && print $File::Find::dir."-".$_."\n"; }
    This test was run on a Pentiun 233 with 128Mb RAM, Windows 2000, FAT32 filesystem
    C:\\Program holds 13477 files in 1206 folders of which 137 matches *.txt

    t1: 27 wallclock secs ( 8.40 usr + 16.76 sys = 25.17 CPU)
    t2: 24 wallclock secs ( 7.69 usr + 15.57 sys = 23.26 CPU)
    t3: 47 wallclock secs (20.30 usr + 23.85 sys = 44.15 CPU)
    t4: 36 wallclock secs (11.04 usr + 23.33 sys = 34.37 CPU)
    t5: 30 wallclock secs (11.12 usr + 18.02 sys = 29.13 CPU)


    /brother t0mas
      Hello t0mas !

      It always amazes me at which places I find users of eConsole - never would I have thought to find a user on perlmonks :) !

      Thanks for doing these tests - I didn't even know there was a Benchmark module ! What amazes me is, that the method of reading a directory twice (as done in t1 and t2) is faster than reading it once and checking for file/directory afterwards - you never stop learning I guess ... I will run these tests on my machine (a lowly P-100 running NT 4) and maybe on a Linux machine as well to get a more complete view of the behaviour :)

        Hi Corion.

        The same thing amazes me. I guess that doing a regexp on all rows at once is faster than doing it on every $entry. I don't know how Perl handles this stuff internaly.
        Maybe it recomplies the regexp every time it uses it or something.

        Pleas do run the test. I would like to see if the results you get is along the same line as the ones I got.

        And about eConsole I would like to say - Transparency Rules...

        /brother t0mas
      Hello t0mas !

      It always amazes me at which places I find users of eConsole - never would I thought to find a user on perlmonks :) !

      Thanks for doing these tests - I didn't even know there was a Benchmark module ! What amazes me is, that the method of reading a directory twice (as done in t1 and t2) is faster than reading it once and checking for file/directory afterwards - you never stop learning I guess ... I will run these tests on my machine (a lowly P-100 running NT 4) and maybe on a Linux machine as well to get a more complete view of the behaviour :)

      I've just run your program (with slight modifications) under Linux on a dual SMP P2-350 machine, on my home directory, whose subdirectories contain about 20 text files and quite a lot (about 500MB) of html files in several directories. The results amazed me. So I did run this test four times in a row, and the last three results were identical but really amazing :

      t1:  7 wallclock secs ( 2.43 usr +  4.27 sys =  6.70 CPU)
      t2:  7 wallclock secs ( 2.43 usr +  4.32 sys =  6.75 CPU)
      t3: 14 wallclock secs ( 8.25 usr +  5.73 sys = 13.98 CPU)
      t4:  7 wallclock secs ( 1.62 usr +  4.77 sys =  6.39 CPU)
      t5:  1 wallclock secs ( 0.84 usr  0.21 sys +  0.00 cusr  0.01 csys =  0.00 CPU)
      

      The trend we can see is, that everything is faster in general, about the factor 3 or 4, but what really is amazing is, how little time &t5(); takes, only 1 wallclock second. So I did interchange &t4() and &t5() to see if that result was order dependant :

      ...
      t4:  1 wallclock secs ( 0.95 usr  0.18 sys +  0.00 cusr  0.01 csys =  0.00 CPU)
      t5:  7 wallclock secs ( 1.75 usr +  4.65 sys =  6.40 CPU)
      

      But it wasn't. This is really strange and sheds some new light on File::Find which I always considered clumsy, and which is one of the slower routines under Win32. Wonders of Perl :).

      To see how the results would change, I then reran your test for files that match .html (while going through the source code, there were some things with your regular expressions - the ".txt" RE will match anything consisting of at least four letters with "txt" not at the start and the directory matching will leave out directories which start with a "." (so unix "hidden" directories will not be searched). I ran the test three times and threw away the first test results on about 500 MB of html files.

      t1:  8 wallclock secs ( 2.59 usr +  4.65 sys =  7.24 CPU)
      t2:  8 wallclock secs ( 2.47 usr +  4.66 sys =  7.13 CPU)
      t3: 17 wallclock secs ( 8.65 usr +  5.90 sys = 14.55 CPU)
      t4:  9 wallclock secs ( 1.67 usr +  5.42 sys =  7.09 CPU)
      t5:  2 wallclock secs ( 1.04 usr  0.23 sys +  0.00 cusr  0.01 csys =  0.00 CPU)
      

      And amazingly, the trend continues, with &t5() beating the rest by far, even though I had thought the whole results should have become console bound anyway, but that wasn't so.

      I wonder what my tests under NT 4 will bring us :)

        It seems that Find::File is better implemented on *nix systems, or that it does a better job reading inodes than the FAT. I was quite amazed that the opendir stunt beat it on Win32.
        Good work. I eagerly await the NT 4 tests.

        /brother t0mas

      I finally got off my lazy back and ran the test on my home machine, a trusty P-100 with 80 MB RAM, and here are the results (with ActivePerl 5.005_03 build 517):

      FAT 16 drive (no HD activity during the second run)
      t1: 17 wallclock secs ( 6.66 usr +  9.89 sys = 16.55 CPU)
      t2: 16 wallclock secs ( 5.89 usr +  8.47 sys = 14.36 CPU)
      t3: 41 wallclock secs (16.67 usr + 18.16 sys = 34.83 CPU)
      t4: 27 wallclock secs ( 8.37 usr + 16.88 sys = 25.26 CPU)
      t5: 15 wallclock secs ( 7.75 usr +  7.07 sys = 14.82 CPU)
      NTFS drive (slight HD activity for the later parts of the HD)
      t1: 96 wallclock secs (30.07 usr + 59.09 sys = 89.17 CPU)
      t2: 87 wallclock secs (27.73 usr + 53.18 sys = 80.91 CPU)
      t3: 179 wallclock secs (72.02 usr + 96.92 sys = 168.94 CPU)
      t4: 142 wallclock secs (36.63 usr + 96.15 sys = 132.78 CPU)
      t5: 81 wallclock secs (35.33 usr + 43.25 sys = 78.58 CPU)
      

      So here File::Find is again on par with the solution reading any directory twice and the solution using rewinddir(), and my favourite method of doing stuff, &t4 dosen't look that good either if you are going for peak performance. The fastest solution takes only half the time, and scanning the whole NTFS HD did take some time as you see :). So once again the rule number one of optimizing holds. Benchmark, benchmark, benchmark.

        Thanks Corion for the testing.
        As you say - Benchmark, benchmark, benchmark. Speed is the King many circumstances, but maybe not all. It seems that t1,t2, and t5 is best in this simple kind of searches, but in more complex cases with lots of heavy evaluations and fileops, t3 and t4 (or a more complex t5) is perhaps better.

        /brother t0mas
      This code is AWESOME!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://15425]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2014-12-25 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls