Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Beginners guide to File::Find

by Aristotle (Chancellor)
on Dec 04, 2002 at 00:10 UTC ( #217378=note: print w/replies, xml ) Need Help??

in reply to Beginners guide to File::Find

Some more nooks and crannies for the last example

'Cause I'm just itching to add a few bits. :)

#!/usr/bin/perl -w use strict; use Getopt::Std; use File::Find; my %opt; @ARGV > 0 and getopts('n:s:m:', \%opt) and not (keys %opt > 1) or die +<< "USAGE"; Shows the biggest files residing in one or several directory trees. usage: $0 [-n num] [-t size] [-m size] directory [directory ...] -n show <num> files -s show biggest files totalling <size> -m show all files bigger than <size> use only one option at a time default is 20 biggest files USAGE my ($switch, $param) = %opt; my %size; find(sub {$size{$File::Find::name}=-s if -f;}, @ARGV); my @sorted = sort {$size{$b} <=> $size{$a}} keys %size; my $maxidx = 0; if($switch eq 's') { ($param -= $size{$_}) >= 0 ? $maxidx++ : last for @sort; } elsif($switch eq 'm') { $size{$_} < $param ? $maxidx++ : last for @sort; } else { $maxidx = ($val || 20) - 1; } printf "%10d $_\n", $size{$_} for @sorted[0 .. $maxidx];

Even more advanced uses

The preprocess and postprocess predicates of File::Find let you do some really wild stuff. To make use of them, you have to use the extended syntax of calling find(). To specify extra options, you have to pass a hash as the first parameter, rather than just a subroutine reference. The simplest case is exactly equivalent to using the subref shorthand:
find( { wanted => \&print_if_dir, }, @dirs); # or find( { wanted => sub { print if -d }, }, @dirs);
Both of the new extra directives, preprocess and postprocess, take a subroutine references, just like the standard wanted one in the above examples. Having that out of the way, let's get to the juicy stuff:


find() passes this routine an array with the entire contents of a directory immediately upon entering the directory and expects it to return the list of interesting files. Any omitted files will not be passed to the wanted function and omitted directories will not even be descended into by find(). This predicate makes File::Find the most powerful tool for all your directory traversal needs. To warm up, here's a silly example that does the same as the previous examples, that is, print only the names of directories:
find( { preprocess => sub { return grep { -d } @_ }, wanted => sub { print }, }, @dirs);

As you (should) know, Perl stores the parameters passed to a subroutine in the special array @_. grep tests all elements of a list passed to it (here: the list of parameters, and thus filenames) against the expression and then returns a new list containing only the elements for which that expression is true. Here, the expression tests whether the entry is a directory, so the result is a list which does not contain any files, symlinks or anything else besides directories. We return this new list, causing find() to forget all the files, symlinks and everything else. It will not pass them to our wanted function, and so we can just print everything we get passed into there. Obviously, this is a contrived example.

So, what really interesting stuff can we do with the preprocess directive? Let's just try to implement the -mindepth and -maxdepth offered by GNU find. Of course, you don't need preprocess to do that. The naive way would be do check the depth of the current location in the directory tree within the wanted function and bail if we're too deep or not deep enough. However, this is wasteful: what if you are traversing a very deep tree with thousands of directories and several hundred thousand files? The wanted function will likely spend most of its time saying "no, not deep enough", "no, too deep", "no, no, too deep", "too deep, next one", throwing away files over and over. The biggest problem here is that even if you only want the files at depth 2-3, find() will happily descend down to level 15, giving wanted all the directories and files it encounters en route, oblivious to the fact that we are only throwing them all away, waiting for the directory traversal to back out up to level 3 again. The solutionn is to use a precprocess routine to cull all directories from the list once we reach the maxdepth, preventing find() from descending any further and getting lost in areas of the tree we aren't interested in anyway. So without further ado:

my ($min_depth, $max_depth) = (2,3); find( { preprocess => \&preprocess, wanted => \&wanted, }, @dirs); sub preprocess { my $depth = $File::Find::dir =~ tr[/][]; return @_ if $depth < $max_depth; return grep { not -d } @_ if $depth == $max_depth; return; } sub wanted { my $depth = $File::Find::dir =~ tr[/][]; return if $depth < $min_depth; print; }
Let's see what happens here. We find out how deep we currently are by counting the forward slashes in the full pathname of wherever we are, $File::Find::dir. If we are below the maximum depth, then we want to look at all files. If we are at the maximum depth, we ditch all directories, so find() will not descend any further. If we somehow got too deep, we return nothing, causing find() to back out of the directory immediately. Finally, in wanted we examine the depth again, in order to avoid processing files below the minimum depth. Because find() needs to descend into these directories we cannot avoid it passing names for directories that are too far up the tree to our wanted function.


This one is a lot less involved; mainly because it neither takes nor returns anything. It is simply called before find() backs out of a directory, which means the entire subtree below it has been processed. In other words, it is safe to mess with the directory without unintentionally confusing find().

The following utility script makes use of this to remove empty directories. It doesn't try to check whether they're empty, because that's relatively complicated (we have to pay attention to the special . and .. entries) and rmdir will not remove a non-empty directory anyway. So we just let it fail harmlessly.

#!/usr/bin/perl -w use strict; use Getopt::Std; use File::Find; @ARGV > 0 and getopts('a:', \my %opt) or die << "USAGE"; Deletes any old files from the directory tree(s) given and removes empty directories en passant. usage: $0 [-a maxage] directory [directory ...] -a maximum age in days, default is 120 USAGE my $max_age_days = $opt{a} || 120; find({ wanted => sub { unlink if -f $_ and -M _ > $max_age_days }, postprocess => sub { rmdir $File::Find::dir }, }, @ARGV);


As if File::Find was not already good enough, these two extra predicates give you the power to do literally anything. preprocess lets you control find()'s behaviour in any way conceivable, and postprocess makes it easy to do any cleanup tasks of all manner without requiring a second directory traversal. Combining these powers makes it very easy to write astonishingly powerful file handling scripts with very little effort.

Update: fixed a couple typos in the text, rearranged a few sentences for clarity. No changes to actual content.

Update: fixed code per reply below.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Beginners guide to File::Find
by Anonymous Monk on May 26, 2009 at 17:11 UTC
    I'm not sure that anyone ever told you the Nooks and crannies does not work (2 curley braces out of place) but as I am sure you, an obvious guru, don't want erroneous code out here associated with your name. I may be wrong and if so let me know... I may be doing something wrong <blush> Dale Clarke
Re^2: Beginners guide to File::Find
by mrdvt92 (Acolyte) on Aug 31, 2010 at 21:26 UTC
    I just need sorting...
    find({ preprocess => sub { return sort @_ }, wanted => \&callback, }, @path);
    It took a bit of googling to find this thread but this adaptation works great. I cannot believe this is not in the pod for File::Find.
Re^2: Beginners guide to File::Find
by Anonymous Monk on Jan 11, 2012 at 04:10 UTC
    Instead of finding files between min and max depth, how do I find all sub directories (except . and ..) between min and max depth?

      See File::Find::Rule, the synopsis has an example

      use File::Find::Rule; # find all the subdirectories of a given directory my @subdirs = File::Find::Rule->directory->in( $directory );

      For future reference, new questions go in Seekers Of Perl Wisdom

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://217378]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2020-10-22 00:52 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (225 votes). Check out past polls.