Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re: Optimizing performance for script to traverse on filesystem

by graff (Chancellor)
on Feb 02, 2012 at 03:41 UTC ( #951339=note: print w/replies, xml ) Need Help??

in reply to Optimizing performance for script to traverse on filesystem

I'll be the devil's advocate and suggest that doing your own recursive solution for traversing a directory tree can save some run time. If you have directories with outrageous quantities of files (e.g. more then 100K files/directory), then a minimal opendir/readdir recursion can even save time over unix/linux "find".

The OP code might be a little more streamlined at run-time by using

while ( my $name = readdir( DIR )) { ... }
instead of loading all directory entries into an array.

In case it helps, here's a similar "hand-rolled" recursive traversal script: Get useful info about a directory tree -- it produces different results from what you want, but the basic recursion part is pretty much the same as yours. I even benchmarked it against a File::Find approach, which took noticeably longer to run, possibly due to the number of subroutine calls per directory entry that File::Find does.

Replies are listed 'Best First'.
Re^2: Optimizing performance for script to traverse on filesystem
by Marshall (Abbot) on Feb 02, 2012 at 07:46 UTC
    I guess that I'm the "devil's advocate" to the "devil's advocate"?

    re: File::Find - I think that we could cooperate and possibly increase internal performance (I'm game for that), but the interface is "spot on" - it works!.

    My suggested modifications to the OP's code represents a massive simplification of program logic.

    There is only one file system operation that happens per $File:Find::name. Maybe File::Find does some more "under the covers"? I'm not sure what you are proposing... But basically, I see no problem with code that makes a single decision based upon a single input.

    I'm game to increase the performance of File::Find - are you willing to help me do it?
    I think that will be be a pretty hard undertaking.
    I'm not sure that it is even possible.
    But if it is, let's go for it!

      Thank you for the invitation. Actually, it might be a worthwhile first step just to make sure my assertion isn't based on faulty evidence. If you get a chance to check out the benchmark in the thread I cited above (specifically at this node: Re^2: Get useful info about a directory tree), it's entirely possible that the timing results there are reflecting something other than a difference between File::Find and straight recursion with opendir/readdir.

      (I've seen enough benchmark discussions here at the monastery to know that a proper benchmark can be an elusive creature.)

      If that benchmark happens to be a valid comparison of the two approaches, it would also be a good exercise for a debugger or profiler session, to see what's causing the difference.

      In any case, I definitely don't want to dissuade people from using File::Find or its various derivatives and convenience wrappers -- they do make for much easier solutions to the basic problem, and in the vast majority of cases, a little extra run time is a complete non-issue. (It's just that I've had to face a few edge cases where improving run time when traversing insanely large directories made a big difference.)

Re^2: Optimizing performance for script to traverse on filesystem
by gdanenb (Acolyte) on Feb 02, 2012 at 06:37 UTC

    If I use

    while ( my $name = readdir( DIR )) { ... }
    I have to leave DIR opened while walking deeper in recursive
    Only when all directories on the level are scaned, I can closedir(DIR)
    Isn't it problematic ?

      Is the structure likely to be more than a few tens of directories deep? If not, no problem. If it is then you'll have to work really hard to fix the problem regardless of what tools you use because most simple solutions will keep directory handles open.

      True laziness is hard work

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://951339]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2018-05-24 20:11 GMT
Find Nodes?
    Voting Booth?