Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: •Re: An alternative to File::Find

by graff (Chancellor)
on Feb 02, 2004 at 02:35 UTC ( [id://325776]=note: print w/replies, xml ) Need Help??


in reply to •Re: An alternative to File::Find
in thread An alternative to File::Find

I'm not sure why you think this is better than File::Find.
Let's just say that I've seen (and posted) evidence that running the "find" utility in a sub-shell was faster than using File::Find on a common task, other things being equal. And I've seen lots of SoPW posts where people have run into a wide variety of problems because they didn't quite figure out the right way to use it -- seems like folks are able to get into all kinds of deep trouble with this module (in fact, this snippet was originally part of a reply to one such SoPW node). In contrast, working through a flat list of directories, and operating on data files within each one, is something that most folks can get their heads around.
You've fulfilled none of your objectives, and only made it more dependant on the outside environment, and slower, and take more net memory.
On the contrary, my goal was to avoid complicated recursion and excess memory consumption within a perl script, and this proposal meets both goals. The C-compiled "find" utility runs with a constant memory footprint, regardless of the size of the directory tree being scanned, and that footprint is very small (less than one meg on both solaris and linux). I'll confess that I haven't looked at how much memory is added to a perl script by using File::Find, so I don't know how that compares; I also haven't checked the memory footprint for "find" in other OS environments.

Compiled "find" handles the recursive part of traversal easily, and allows the perl script to focus on the non-recursive part of the problem. And "find" is faster than File::Find (I wonder whether you have seen any evidence that would contradict this). Dependency on the "outside environment" is certainly not an evil in itself, especially when it saves time during both coding and execution -- it's a good feature of perl that this sort of dependency is easy to exploit (as in "not reinventing the wheel").

Update: (I think this may be the first time I ever downvoted one of your nodes, merlyn.) I installed File::Finder (along with the "Text::Glob" module that it depends on) just to try it out. I'm sure the OO-style approach is appealing, but I wonder whether you would recommend a different way to benchmark it... The timings shown below are on a linux box, using a target directory that contains nearly 2000 files, 17 of which are sub-directories, going down as far as four levels:

#!/usr/bin/perl use strict; use Benchmark; use File::Finder; use File::Find; my $Usage = "$0 some/path\n"; die $Usage unless @ARGV and -d $ARGV[0]; #chdir $ARGV[0] or die "can't chdir to $ARGV[0]"; # (no, don't chdir; just pass the target path to [Ff]ind... timethese( 50, { 'File::Finder module' => \&try_Finder, 'shell-find pipeline' => \&try_pipe, }); sub try_Finder { my $files = File::Finder->type('f'); find( $files->print, $ARGV[0] ); } sub try_pipe { open( FIND, "find $ARGV[0] -type f |" ); print while (<FIND>); close FIND; } __END__ # Output: Benchmark: timing 50 iterations of File::Finder module, shell-find pip +eline... File::Finder module: 9 wallclock secs ( 8.44 usr + 0.75 sys = 9.19 +CPU) @ 5.44/s (n=50) shell-find pipeline: 2 wallclock secs ( 0.47 usr 0.06 sys + 0.38 cu +sr 0.50 csys = 1.41 CPU) @ 94.34/s (n=50)
(another update: Just to clarify, I ran the above with a command line like this:
perl test-find.pl some_path | grep -v some_path
so that only the benchmark output went to the terminal, and the time to actually send 100 * 2000 file-names to the screen was not part of the comparison.)

last update: (I promise!) Just to be sure, I tried using different "names" (hash keys) for the two test functions, so that the benchmark would run the shell version first -- just in case there was a "first time through vs. cached" issue when scanning the directory -- and the results came out the same: "find" is many times faster than File::Find.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://325776]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-19 08:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found