Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Fastest way to recurse through VERY LARGE directory tree

by puterboy (Scribe)
on Jan 21, 2011 at 01:53 UTC ( [id://883444]=perlquestion: print w/replies, xml ) Need Help??

puterboy has asked for the wisdom of the Perl Monks concerning the following question:

Seeking help from Perl Efficiency Monks...

I need to recursively perform an action on all files in a large directory tree (tens of millions of files).

Basically, on each file I need to know the (relative) path of the file and do a 'stat' to get the inode number, the size, and the number of links.

Typically I would use File::Find but I was wondering how much overhead there was and if so would I be better off just using manual recursion with opendir/readdir/closedir and perhaps both avoid overhead and potential duplicate calls to stat (that might be buried in the find algorithm).

If, recursion with opendir is reasonably faster, does anybody have some streamlined code to offer so I can avoid "dumb" things that would slow down the recursion?

If Find is just as fast, are there any 'gotchas' I should avoid that would slow things down?

  • Comment on Fastest way to recurse through VERY LARGE directory tree

Replies are listed 'Best First'.
Re: Fastest way to recurse through VERY LARGE directory tree
by JavaFan (Canon) on Jan 21, 2011 at 03:08 UTC

    It may very well be that whatever you come up with is only faster than File::Find on some setups of disks/volume managers/filesystem, and slower on others.

    You need to benchmark to find out.

    Now, in theory, carefully handcrafting something that does exactly what you need is going to be faster than a more general setup than File::Find. But whether that's actually be measurable is a different question.

    So, benchmark.

    Of course, as you describe the problem, the bottleneck might very well be your I/O.

    Hence, benchmark.

    Have I said you should benchmark? No? Well, benchmark!

      Currently the best way for lightweight scanning very big directory tree, is using library File::Find::Object::Rule

      Using this version, you can make secure iterator object, that do not load all scanned tree into memory before it start to work. Example use is very simple as iterator mode:

      $rule=File::Find::Object::Rule->new(); $rule->Some_filter_method_read_library_examples(parameters)->eventuall +y_next_filter(); $rule->start(path_or_array_of_paths); #here will be initialized iterat +or. don't panic, it will not load all big directory structure while (){ my $item=$rule->match(); #read one single item. I prefer do it here, + it prevents matching name as while loop break last unless defined $item; #stop looping after last element #here do anything with $item, it is path, example: printf "Fetched [%s]\n",$item; if (-l $item) {print "it is symbolic link\n"}; };
      you can leave this loop in any state, and for example start next scanning by calling next $rule->start(@new_searches). It will be reinitialized, for me it works. Of course, in that situation you'' use identical filters as previous. If you want do with different filters, call .....->new() and $rule->some_filters() again. warning, this is fork from library File::Find::Rule and File::Find, currently unmaintained for a long time. this notice I found on metacpan.
Re: Fastest way to recurse through VERY LARGE directory tree
by ahmad (Hermit) on Jan 21, 2011 at 03:02 UTC
    use File::Find;

    It's fast enough, and will do the job for you.

Re: Fastest way to recurse through VERY LARGE directory tree
by salva (Canon) on Jan 21, 2011 at 07:08 UTC
    I am sure the bottleneck is going to be the disk I/O, so there is little you can do from Perl to speed the operation.

    ...well, maybe the order used to traverse the file system (deep or breadth first) could have some influence.

Re: Fastest way to recurse through VERY LARGE directory tree
by graff (Chancellor) on Jan 22, 2011 at 18:25 UTC
    I've been a devotee of using the compiled unix/linux "find" utility in preference to File::Find or any of its variants/derivatives, because I found that for any directory tree of significant size, something like this:
    open( FIND, '-|', 'find', $path, @args ); while (<FIND>) { .... }
    was significantly faster. In fact, just for grins, I tried an old benchmark that I posted here several years ago, to see if the results were still true with reasonably modern versions (perl 5.10.0 on macosx, File::Find 1.12), and I found an order of magnitude difference on a reasonably large tree (~30K files, 90 sec using File::Find, 9 sec using "find".)

    But then I ran into a case where someone had created a really obscene quantity of files in a single directory on a freebsd file server, and freebsd's "find" utility choked. (Apparently, that version of "find" was building some sort of in-memory storage for each directory, and it hit a massive number of page faults on the path in question.)

    I reverted to a recursive opendir/readdir approach for that case, and it succeeded reasonably well. Under "normal" conditions, compiled "find" seems to run about 10% faster than using recursive opendir/readdir, but in that particular case of an "abnormal" directory, freebsd "find" became effectively unusable, while opendir/readdir performance was consistent with normal conditions.

    I just posted a utility I wrote for scanning directories, which uses recursive opendir/readdir: Get useful info about a directory tree -- I'm sure it includes a lot of baggage that you don't need, but perhaps it won't be too hard to pick out the useful bits...

Re: Fastest way to recurse through VERY LARGE directory tree
by eff_i_g (Curate) on Jan 21, 2011 at 16:03 UTC
      Unless the OP needs the entire list of "tens of millions of files", I'd suggest not using File::Find::Rule, and I'd instead suggest an iterator or callback based routine (like File::Find). If you can process files one at a time, there's no need to build such a huge list.

        I don't follow you. File::Find::Rule does not simply return every file (although it can). You can instruct it what to return based on type, size, name, and even make a determination via a custom sub. You can have the sub perform actions and ignore the larger return value, or iterate with the start and match methods.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://883444]
Approved by roubi
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (7)
As of 2024-04-22 22:21 GMT
Find Nodes?
    Voting Booth?

    No recent polls found