Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Keep It Simple, Stupid
 
PerlMonks  

Beginners guide to File::Find

by spurperl (Priest)
on Dec 03, 2002 at 09:41 UTC ( #217166=perltutorial: print w/ replies, xml ) Need Help??

Traversing the directory tree

It is often needed to traverse all files in some directory tree recursively - similarly to what the Unix "find" command does, in Perl. It is possible to do so the "hard way", using opendir, readdir and their friends. But in Perl, naturally, TMTOWTDI. Not only I want to present an "other way to do it", but IMHO a "better way to do it", especially for beginners who only need to perform simple tasks.

File::Find basics

Just remember - if you have to traverse files recursively and do some processing on them, this is your friend:
use File::Find;
This module makes recursive file traversal as easy as you could imagine. The following is a naked template for working with this module:
use File::Find; my $dir = # whatever you want the starting directory to be find(\&do_something_with_file, $dir); sub do_something_with_file { #..... }
First, a starting directory is initialized in $dir. If you imagine the directory structure as a tree, this is the root, from which the search starts.
Then, find (a function from the File::Find module) is called. It is given a reference to a subroutine and the starting directory. find will traverse the directory tree and call the supplied subroutive on each file (be it just a file, a directory, a link, etc).
Then we see the definition of the processing function. It gets one argument (stored in $_), the file currently seen by find. Consider the following simple example (it prints the names of all directories, starting with "." - the current directory):
use File::Find; find(\&print_name_if_dir, "."); sub print_name_if_dir { print if -d; }
Here, the subroutine print_name_if_dir is given as an argument to find. It simply prints the name of the file if it's a directory. Note the peculiar notation... It's customary in Perl not to mention $_, so:
print if -d;
Is equivalent to:
print $_ if -d $_;
Both are quite cryptic (but hey, it's Perl), and for clarity the routine could be rewritten as:
sub print_name_if_dir { my $file = $_; print $file if -d $file; }
Routines in Perl can be anonymous, which is more suitable for such simple tasks, so the whole program may be rewritten as:
use File::Find; my $dir = # whatever you want the starting directory to be find(sub {print if -d}, $dir);
Just 3 lines of code, and we're already doing something useful !

For the more advanced

The internal variable $File::Find:name can be used at any time to report the full path to the file. Consider the following improved version of our little script:
use File::Find; find(sub {print $File::Find::name if -d}, ".");
Try running it and compare the results to the previous version. You will notice that it prints the full path to the directory. What happens is the following - Find::File chdirs into each directory it finds in its search, and $_ gets only the short name (w/o path) of the file, while Find::File::name gets the full path. If, for some reason, you don't want it to chdir, you may specify no_chdir as a parameter. Parameters to find are passed as a hash reference:
use File::Find; find({wanted => sub {print $File::Find::name if -d} no_chdir => 1}, ".");
Note that "wanted" is the key for the file processing routine in this hash.
The results won't differ from the previous version. Here, however, $_ will also be the full path to a file, because find doesn't "dive into" the directories.
Other parameters may be specified (like 'bydepth' if want a depth-first-search), but these are advanced topics. If you're curious, you can look these issues up in the documentation of the module.

Bonus - a useful utility based on File::Find

Ever felt that your quota suffocates you, and couldn't find the unnecessary large files to remove ? Do you find "du" too tedious to use in these cases ? File::Find comes to the rescue. Consider the following script... It takes a starting directory, and prints the 20 largest files found in the tree under this directory - specifying full paths, so you can just cut-n-paste them into "rm":
#!/usr/local/bin/perl -w ($#ARGV == 0) or die "Usage: $0 [directory]\n"; use File::Find; find(sub {$size{$File::Find::name} = -s if -f;}, @ARGV); @sorted = sort {$size{$b} <=> $size{$a}} keys %size; splice @sorted, 20 if @sorted > 20; foreach (@sorted) { printf "%10d %s\n", $size{$_}, $_; }
What goes on here ? find traverses the given directory recursively, taking notice of each file's size in the $size hash table (-s if -f means = get the size if this is a file). Then, it sorts the hash table by size, and prints the 20 largest files. That's it... I use this utility quite a lot to clean space, I hope you find it useful too (and also understand exactly how it works !)

Update:

Thanks to rinceWind for this:
File::Find is cross-platform. It's one of the really handy ways for iterating directory trees on Windows - something Microsoft don't encourage you to do, with their 'hidden files' (File::Find X-rays through Windows hidden files mechanism nicely :-).

With this in mind, though, you must be careful when working with Windows' paths, because slashes there have a different direction. There is a nice tutorial - Paths in Perl, that explains this.

Update 2:

There are some nice continuation replies written to this tutorial - special thanks to Aristotle, who supplied some info for the real advanced use of File::Find.

Conclusion

File::Find can turn the tasks dealing with recursive file traversal from torture to pleasure, if you know how to use it. Modules like this make Perl a wonderful language it is - you can perform useful tasks without pain. Enjoy !

Edit by tye to add READMORE

Comment on Beginners guide to File::Find
Select or Download Code
Re: Beginners guide to File::Find
by rinceWind (Monsignor) on Dec 03, 2002 at 10:57 UTC
    Excellent, spurperl, a good introduction to a very useful module. There's a couple of points I feel should be included though.

    File::Find is cross-platform. It's one of the really handy ways for iterating directory trees on Windows - something Microsoft don't encourage you to do, with their 'hidden files' (File::Find X-rays through Windows hidden files mechanism nicely :-).

    There's also a gotcha about the direction and number of the slashes. The value of $_ and $File::Find::name is not what you would expect in Windows.

      Thanks for the feedback rinceWind

      I will include a notice about the cross-platformness of Find.

      Also, the slashes thing - I assume it mostly applies to Windows ? If so, I've seen a nice explanation about these things in another tutorial - Paths in Perl, perhaps I can include a pointer to it

Re: Beginners guide to File::Find
by adrianh (Chancellor) on Dec 03, 2002 at 11:58 UTC

    ++. Nice introduction.

    Might be worth mentioning Find::File::Rule as an alternative way of going about things.

Re: Beginners guide to File::Find
by princepawn (Parson) on Dec 03, 2002 at 14:25 UTC
    Not only I want to present an "other way to do it", but IMHO a "better way to do it", especially for beginners who only need to perform simple tasks.
    Don't forget about glob, it is very good for very simple tasks.

    Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Re: Beginners guide to File::Find
by Aristotle (Chancellor) on Dec 04, 2002 at 00:10 UTC

    Some more nooks and crannies for the last example

    'Cause I'm just itching to add a few bits. :)

    #!/usr/bin/perl -w use strict; use Getopt::Std; use File::Find; my %opt; @ARGV > 0 and getopts('n:s:m:', \%opt) and not (keys %opt > 1) or die +<< "USAGE"; Shows the biggest files residing in one or several directory trees. usage: $0 [-n num] [-t size] [-m size] directory [directory ...] -n show <num> files -s show biggest files totalling <size> -m show all files bigger than <size> use only one option at a time default is 20 biggest files USAGE my ($switch, $param) = %opt; my %size; find(sub {$size{$File::Find::name}=-s if -f;}, @ARGV); my @sorted = sort {$size{$b} <=> $size{$a}} keys %size; my $maxidx = 0; if($switch eq 's') { ($param -= $size{$_}) >= 0 ? $maxidx++ : last for @sort; } elsif($switch eq 'm') { $size{$_} < $param ? $maxidx++ : last for @sort; } else { $maxidx = ($val || 20) - 1; } printf "%10d $_\n", $size{$_} for @sorted[0 .. $maxidx];

    Even more advanced uses

    The preprocess and postprocess predicates of File::Find let you do some really wild stuff. To make use of them, you have to use the extended syntax of calling find(). To specify extra options, you have to pass a hash as the first parameter, rather than just a subroutine reference. The simplest case is exactly equivalent to using the subref shorthand:
    find( { wanted => \&print_if_dir, }, @dirs); # or find( { wanted => sub { print if -d }, }, @dirs);
    Both of the new extra directives, preprocess and postprocess, take a subroutine references, just like the standard wanted one in the above examples. Having that out of the way, let's get to the juicy stuff:

    preprocess

    find() passes this routine an array with the entire contents of a directory immediately upon entering the directory and expects it to return the list of interesting files. Any omitted files will not be passed to the wanted function and omitted directories will not even be descended into by find(). This predicate makes File::Find the most powerful tool for all your directory traversal needs. To warm up, here's a silly example that does the same as the previous examples, that is, print only the names of directories:
    find( { preprocess => sub { return grep { -d } @_ }, wanted => sub { print }, }, @dirs);

    As you (should) know, Perl stores the parameters passed to a subroutine in the special array @_. grep tests all elements of a list passed to it (here: the list of parameters, and thus filenames) against the expression and then returns a new list containing only the elements for which that expression is true. Here, the expression tests whether the entry is a directory, so the result is a list which does not contain any files, symlinks or anything else besides directories. We return this new list, causing find() to forget all the files, symlinks and everything else. It will not pass them to our wanted function, and so we can just print everything we get passed into there. Obviously, this is a contrived example.

    So, what really interesting stuff can we do with the preprocess directive? Let's just try to implement the -mindepth and -maxdepth offered by GNU find. Of course, you don't need preprocess to do that. The naive way would be do check the depth of the current location in the directory tree within the wanted function and bail if we're too deep or not deep enough. However, this is wasteful: what if you are traversing a very deep tree with thousands of directories and several hundred thousand files? The wanted function will likely spend most of its time saying "no, not deep enough", "no, too deep", "no, no, too deep", "too deep, next one", throwing away files over and over. The biggest problem here is that even if you only want the files at depth 2-3, find() will happily descend down to level 15, giving wanted all the directories and files it encounters en route, oblivious to the fact that we are only throwing them all away, waiting for the directory traversal to back out up to level 3 again. The solutionn is to use a precprocess routine to cull all directories from the list once we reach the maxdepth, preventing find() from descending any further and getting lost in areas of the tree we aren't interested in anyway. So without further ado:

    my ($min_depth, $max_depth) = (2,3); find( { preprocess => \&preprocess, wanted => \&wanted, }, @dirs); sub preprocess { my $depth = $File::Find::dir =~ tr[/][]; return @_ if $depth < $max_depth; return grep { not -d } @_ if $depth == $max_depth; return; } sub wanted { my $depth = $File::Find::dir =~ tr[/][]; return if $depth < $min_depth; print; }
    Let's see what happens here. We find out how deep we currently are by counting the forward slashes in the full pathname of wherever we are, $File::Find::dir. If we are below the maximum depth, then we want to look at all files. If we are at the maximum depth, we ditch all directories, so find() will not descend any further. If we somehow got too deep, we return nothing, causing find() to back out of the directory immediately. Finally, in wanted we examine the depth again, in order to avoid processing files below the minimum depth. Because find() needs to descend into these directories we cannot avoid it passing names for directories that are too far up the tree to our wanted function.

    postprocess

    This one is a lot less involved; mainly because it neither takes nor returns anything. It is simply called before find() backs out of a directory, which means the entire subtree below it has been processed. In other words, it is safe to mess with the directory without unintentionally confusing find().

    The following utility script makes use of this to remove empty directories. It doesn't try to check whether they're empty, because that's relatively complicated (we have to pay attention to the special . and .. entries) and rmdir will not remove a non-empty directory anyway. So we just let it fail harmlessly.

    #!/usr/bin/perl -w use strict; use Getopt::Std; use File::Find; @ARGV > 0 and getopts('a:', \my %opt) or die << "USAGE"; Deletes any old files from the directory tree(s) given and removes empty directories en passant. usage: $0 [-a maxage] directory [directory ...] -a maximum age in days, default is 120 USAGE my $max_age_days = $opt{a} || 120; find({ wanted => sub { unlink if -f $_ and -M _ > $max_age_days }, postprocess => sub { rmdir $File::Find::dir }, }, @ARGV);

    Conclusion

    As if File::Find was not already good enough, these two extra predicates give you the power to do literally anything. preprocess lets you control find()'s behaviour in any way conceivable, and postprocess makes it easy to do any cleanup tasks of all manner without requiring a second directory traversal. Combining these powers makes it very easy to write astonishingly powerful file handling scripts with very little effort.

    Update: fixed a couple typos in the text, rearranged a few sentences for clarity. No changes to actual content.

    Update: fixed code per reply below.

    Makeshifts last the longest.

      I'm not sure that anyone ever told you the Nooks and crannies does not work (2 curley braces out of place) but as I am sure you, an obvious guru, don't want erroneous code out here associated with your name. I may be wrong and if so let me know... I may be doing something wrong <blush> Dale Clarke dalec@delta1.net
      I just need sorting...
      find({ preprocess => sub { return sort @_ }, wanted => \&callback, }, @path);
      It took a bit of googling to find this thread but this adaptation works great. I cannot believe this is not in the pod for File::Find.
      Instead of finding files between min and max depth, how do I find all sub directories (except . and ..) between min and max depth?

        See File::Find::Rule, the synopsis has an example

        use File::Find::Rule; # find all the subdirectories of a given directory my @subdirs = File::Find::Rule->directory->in( $directory );

        For future reference, new questions go in Seekers Of Perl Wisdom

Re: Beginners guide to File::Find
by spurperl (Priest) on Dec 05, 2002 at 06:04 UTC
    Thanks for all the great comments, monks. I've added a couple of updates to reflect the ones I see most important.
Re: Beginners guide to File::Find
by tos (Deacon) on May 25, 2003 at 19:28 UTC
    Hi,

    File::Find wasn't my friend in the past. Therefore i appreciate your interesting article. A few times i looked for possibility to work with File::Find as can be done with gnu-find regarding the maxdepth-feature.

    The no_chdir-parameter seemed to me as a possible beginning for this.

    But when i tested the concerning code-snipped in your article i couldn't recognize any difference between false or true no_chdir

    # cat pf1 #! /usr/local/bin/perl use warnings; use strict; use File::Find; my $dir = shift @ARGV; find({wanted => sub {print $File::Find::name,"\n" if -d}, no_chdir => 0}, "$dir"); print "\n"; find({wanted => sub {print $File::Find::name,"\n" if -d}, no_chdir => 1}, "$dir");
    given directory da
    # find da da da/s1 da/s1/f1 da/s1/f2 da/s2 da/s2/sx da/s2/sx/sa da/s2/sx/f4 da/s2/sx/f5 da/s2/sy da/s2/f3 da/s3 da/s3/sz da/s3/sz/f7 da/s3/sz/f8 da/s3/f6 da/s4
    pf1' output
    # ./pf1 ./da ./da ./da/s4 ./da/s3 ./da/s3/sz ./da/s2 ./da/s2/sy ./da/s2/sx ./da/s2/sx/sa ./da/s1 ./da ./da/s4 ./da/s3 ./da/s3/sz ./da/s2 ./da/s2/sy ./da/s2/sx ./da/s2/sx/sa ./da/s1
    What's the problem ?

    greetings, tos

      tos,
      As far as I can tell - there is no problem. I just spent about 20 minutes looking at the find2perl code as well as looking at the docs on File::Find. It doesn't appear that there is support for the gnu find's maxdepth option.

      But don't get mad, get even by checking out File::Find::Rule which does have a maxdepth option and is argued by some to be easier to use than File::Find.

      Cheers - L~R

      You don't see any difference when changing the no_chdir parameter because you are just printing $File::Find::name which is the full path of the file, and no_chdir does not change that. If you print $_ instead, you can see the difference: with no_chdir set to 1, the filename will be relative to the directory you specified for find() to begin the search (since it will not change dirs as it traverses the filesystem tree). If no_chdir is set to 0, then find() will chdir into dirs as it traverses the tree, giving you the filename (relative to that file's directory) in $_.

      The no_chdir option isn't what you want to implement GNU find's maxdepth, check Aristotle's reply to this thread instead.

      hope this helps,

Re: Beginners guide to File::Find
by Anonymous Monk on Mar 10, 2005 at 14:55 UTC
    This looked promising but the examples don't work on win32. Or I'm probably doing something else wrong.
      Dear anonymous monk, I can assure you that these fine examples do work on Win32. Don't give up on File::Find, it makes life better.
      If I have a need to process all of the files in a directory tree, I use find2perl to generate a template. If you are familiar with the *ix find command, there are several options you can use to your advantage, but I usually stick to find2perl / -type f -print > template.pl . You can substitute any starting directory instead of / to suit, and the -type f will limit the code to files.

      You should end up with this subroutine:

      sub wanted { my ($dev,$ino,$mode,$nlink,$uid,$gid); (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) && -f _ && print("$name\n"); }

      I replace the print("$name\n") with my code that performs an action on every file.

        Does anyone know how to get the return value for find ? There is no documentation or any discussion on it

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://217166]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (14)
As of 2014-04-23 17:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (549 votes), past polls