Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Perl Monk, Perl Meditation
 
PerlMonks  

An alternative to File::Find

by graff (Chancellor)
on Jan 30, 2004 at 06:14 UTC ( #325146=snippet: print w/ replies, xml ) Need Help??

Description: To traverse a directory tree and do stuff with some or all of the data files therein, this method works very fast, takes up very little memory, and is a relatively easy framework for handling lots of jobs of this ilk. It involves using the standard unix "find" utility (which has been ported for ms-windows users, of course).

# assume you have a $toppath, which is where the traversal starts

chdir $toppath or die "can't cd to $toppath: $!";

open( FIND, "find . -type d -print0 |" ) or die "can't run find: $!";

# find will traverse downward from current directory
# (ie. $toppath), and because of the "-type d" option,
# will only list the paths of directories contained here;
# the "-print0" (thanks, etcshadow) sets a null byte as the
# string terminator for each file name (don't rely on "\n",
# which could be part of a file name).

{
  local $/ = "\x0";  # added thanks to etcshadow's reply
  while ( my $dir = <FIND> ) {
    chomp $dir;
    unless ( opendir( DIR, $dir )) {
        warn "$toppath/$dir: opendir failed: $!\n";
        next;
    }
    while ( my $file = readdir( DIR )) {
        next if ( -d "$dir/$file" ); # outer while loop will handle al
+l dirs
        # do what needs to be done with data files
    }
    closedir DIR;
    # anything else we need to do regarding this directory
  }
}
close FIND;

Comments:

The nice thing about this approach is that the "find" utility is very good with the recursive descent into subdirectories, and that's all it needs to do. Meanwhile, perl is very good with reading directory contents and manipulating data files, and it's really easy to do this when you're just working with data files in one directory at a time. Here, Perl can just skip over any subdirectories that it sees, because the output from "find" will bring those up for treatment in due course.

(update: made minor adjustments to comments in the code, added "closedir"; also wanted to point out that the loop over files could be moderated by using "grep ... readdir(DIR)", etc.)

Comment on An alternative to File::Find
Download Code
Re: An alternative to File::Find
by etcshadow (Priest) on Jan 30, 2004 at 06:32 UTC
    Man... if you're gonna suggest something like this, at least make your find command:
    "find . -type d -print0 |"
    and then in your perl, set
    local $/ = "\0";
    This is exactly the sort of thing that gives perl a bad rep... it's basically no harder to use print0 and $/="\0", yet you don't do it. It's like people not checking the return value of malloc, for cryin out loud!
    ------------ :Wq Not an editor command: Wq
      Right -- that's nice. It would get rid of the "chomp $dir;", and ... um ... AH! It took me a while to get your point: Sometimes, one or more characters within a file name happens to be "\n"! (When this happens it's truly evil, but I know it does happen.) Thanks!

      update: For some reason, the "/usr/bin/find" that comes with solaris 8 does not support the "-print0" flag -- need the GNU version to do that. Oh well.

        Exactly... but you still shouldn't get rid of the chomp... the chomp will just take off the "\0" (chomp removes a trailing $/, even if $/ is set to something other than "\n").
        ------------ :Wq Not an editor command: Wq
•Re: An alternative to File::Find
by merlyn (Sage) on Jan 30, 2004 at 12:11 UTC
    To traverse a directory tree and do stuff with some or all of the data files therein, this method works very fast, takes up very little memory, and is a relatively easy framework for handling lots of jobs of this ilk. It involves using the standard unix "find" utility (which has been ported for ms-windows users, of course).
    I'm not sure why you think this is better than File::Find. You've fulfilled none of your objectives, and only made it more dependant on the outside environment, and slower, and take more net memory.

    If you really don't like the interface of File::Find, try my File::Finder, which is essentially find implemented in Perl.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I'm not sure why you think this is better than File::Find.
      Let's just say that I've seen (and posted) evidence that running the "find" utility in a sub-shell was faster than using File::Find on a common task, other things being equal. And I've seen lots of SoPW posts where people have run into a wide variety of problems because they didn't quite figure out the right way to use it -- seems like folks are able to get into all kinds of deep trouble with this module (in fact, this snippet was originally part of a reply to one such SoPW node). In contrast, working through a flat list of directories, and operating on data files within each one, is something that most folks can get their heads around.
      You've fulfilled none of your objectives, and only made it more dependant on the outside environment, and slower, and take more net memory.
      On the contrary, my goal was to avoid complicated recursion and excess memory consumption within a perl script, and this proposal meets both goals. The C-compiled "find" utility runs with a constant memory footprint, regardless of the size of the directory tree being scanned, and that footprint is very small (less than one meg on both solaris and linux). I'll confess that I haven't looked at how much memory is added to a perl script by using File::Find, so I don't know how that compares; I also haven't checked the memory footprint for "find" in other OS environments.

      Compiled "find" handles the recursive part of traversal easily, and allows the perl script to focus on the non-recursive part of the problem. And "find" is faster than File::Find (I wonder whether you have seen any evidence that would contradict this). Dependency on the "outside environment" is certainly not an evil in itself, especially when it saves time during both coding and execution -- it's a good feature of perl that this sort of dependency is easy to exploit (as in "not reinventing the wheel").

      Update: (I think this may be the first time I ever downvoted one of your nodes, merlyn.) I installed File::Finder (along with the "Text::Glob" module that it depends on) just to try it out. I'm sure the OO-style approach is appealing, but I wonder whether you would recommend a different way to benchmark it... The timings shown below are on a linux box, using a target directory that contains nearly 2000 files, 17 of which are sub-directories, going down as far as four levels:

      #!/usr/bin/perl use strict; use Benchmark; use File::Finder; use File::Find; my $Usage = "$0 some/path\n"; die $Usage unless @ARGV and -d $ARGV[0]; #chdir $ARGV[0] or die "can't chdir to $ARGV[0]"; # (no, don't chdir; just pass the target path to [Ff]ind... timethese( 50, { 'File::Finder module' => \&try_Finder, 'shell-find pipeline' => \&try_pipe, }); sub try_Finder { my $files = File::Finder->type('f'); find( $files->print, $ARGV[0] ); } sub try_pipe { open( FIND, "find $ARGV[0] -type f |" ); print while (<FIND>); close FIND; } __END__ # Output: Benchmark: timing 50 iterations of File::Finder module, shell-find pip +eline... File::Finder module: 9 wallclock secs ( 8.44 usr + 0.75 sys = 9.19 +CPU) @ 5.44/s (n=50) shell-find pipeline: 2 wallclock secs ( 0.47 usr 0.06 sys + 0.38 cu +sr 0.50 csys = 1.41 CPU) @ 94.34/s (n=50)
      (another update: Just to clarify, I ran the above with a command line like this:
      perl test-find.pl some_path | grep -v some_path
      so that only the benchmark output went to the terminal, and the time to actually send 100 * 2000 file-names to the screen was not part of the comparison.)

      last update: (I promise!) Just to be sure, I tried using different "names" (hash keys) for the two test functions, so that the benchmark would run the shell version first -- just in case there was a "first time through vs. cached" issue when scanning the directory -- and the results came out the same: "find" is many times faster than File::Find.

Re: An alternative to File::Find
by crabbdean (Pilgrim) on Mar 04, 2004 at 13:09 UTC
    In addition to using this code for directory traversing also see the node Re: Re: Re: Importing external dependencies into source that I just wrote on compiling your Perl source and the GNU find.exe into a single executable to remove the external dependencies. Together the two offer a complete single source solution without using File::Find

    Enjoy!
    Dean

    Programming these days takes more than a lone avenger with a compiler. - sam

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://325146]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-04-19 23:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls