Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

File::Find redux: how to limit hits?

by 914 (Pilgrim)
on Jun 04, 2002 at 00:32 UTC ( #171367=perlquestion: print w/replies, xml ) Need Help??
914 has asked for the wisdom of the Perl Monks concerning the following question:

In this earler SoPW node, i asked for and recieved some great advice about File::Find and how to limit it's results similarly to a find . -name foo.txt -maxdepth 3 -mindepth 3

This time around, i need some help figuring out how to make File::Find return (ie, stop searching/producing results) after an arbitrary number of hits.

Something like:

find({ wanted => sub { my ($dev,$ino,$mode,$nlink,$uid,$gid); my $depth = tr!/!!; # count slashes to get depth return if (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) +&& (int(-M _) > $maxage) or (($depth < 3) or ($depth > 3)) ; if ($_ =~ /threads\.html\z/s) { push @files, $_; } }, no_chdir => 1 }, '.', $MAXHITS );
Where $MAXHITS is set to the maximum number of results i need. (other than that extra arg, this is the exact find(); call i'm using.

The story is this... i'm using this function to find all the files necessary to recreate an index file at, and to actually create that file.

My current script (on my scratchpad) works for this quite well (i'll gladly provide a tarball of test filestructure to work with, just ask), without fail.

What i'm concerned about is that my test files/filestructure wasn't too big, but some of the 'live' directories that this may be working on could have literally thousands (tens, maybe 100) of files, in a very complicates directory tree. (i know, i know... but when we wrote the CGI that runs the BBS (in C, even.. heretics i know!) the filesystem-as-ersatz-database seemed OK) of files.

This introduces a huge time lag as the find function walks the tree. Since the files are invariably 'found' in the order of last-created (i'm not sure why, but it's so..) it's quite safe for me to say "stop finding once you've found n hits" since it's sure that those n files are the most recent n files, and are the ones i'm interested in.

Is there a way to do this? Does it involve hacking the module? i'm willing to give that a shot... but if there's a better non-wheel-reinventing solution...

And, if this *does* involve modifying the module, can/should/how do i post the changes so that others can use it?

as always, thanks!

Replies are listed 'Best First'.
Re: File::Find redux: how to limit hits?
by Kanji (Parson) on Jun 04, 2002 at 01:09 UTC

    There's probably a more graceful way of doing this, but an eval/die combo should work...

    my $MAXHITS = 100; eval { find( \&wanted => $dir ); }; die $@ if $@ && $@ ne "Limit reached\n"; { my $hit_no = 1; sub wanted { die "Limit reached\n" if ++$hit_no > $MAXHITS; printf "%03d %s\n", $hit_no, $File::Find::name; } }


      The only way I can see to do it if you want to avoild the eval/die combo is to use $File::Find::prune. This appears to do the trick:

      #! /usr/bin/perl -w use strict; use File::Find; my @hits = (); my $hit_lim = shift || 20; find( sub { if( scalar @hits >= $hit_lim ) { $File::Find::prune = 1; return; } elsif( -d $_ ) { return; } push @hits, $File::Find::name; }, shift || '.' ); $, = "\n"; print @hits, "\n";

      The only problem is that you don't have much control over the order in which File::Find descends through the various directories. (Hint: it is not alphabetical).

      print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: File::Find redux: how to limit hits?
by graff (Chancellor) on Jun 04, 2002 at 07:31 UTC
    Well, I don't know if anyone else has noticed this (maybe my mileage is way different) but when I tried timing a simple File::Find approach (as provided by "find2perl") against the equivalent back-tick `find ...` command on my linux box, I got a 5-to-1 wallclock ratio:
    use strict; use Benchmark; use File::Find (); # this part is unnecessary, but find2perl made it up, so I copied it: use vars qw/*name *dir *prune/; *name = *File::Find::name; *dir = *File::Find::dir; *prune = *File::Find::prune; timethese( 10, { 'File::Find' => \&find2perl, 'Shell:Find' => \&shellfind, }); my @found; # I'm not using this for anything at present sub find2perl { @found = (); File::Find::find({wanted => \&wanted}, '.'); # made by find2perl } sub wanted { # made by find2perl my ($dev,$ino,$mode,$nlink,$uid,$gid); (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) && -d _ && push @found, $_; } sub shellfind { @found = `find . -type d`; } __END__ # OUTPUT: Benchmark: timing 10 iterations of File::Find, Shell:Find... File::Find: 27 wallclock secs (20.87 usr 4.01 sys + 0.01 cusr 0.00 +csys = 24.89 CPU) @ 0.40/s (n=10) Shell:Find: 5 wallclock secs ( 0.21 usr 0.01 sys + 1.28 cusr 3.36 +csys = 4.86 CPU) @ 45.45/s (n=10) # I printed scalar(@found) in one test, and these results # were obtained where there were over 6K directories under "."

    So, maybe the first thing to try for speeding things up is:

    # don't use File::Find;
    Apart from that, have you considered an alternative like this:
    • run a separate process to create and maintain a set of file name listings for each path; the first time you run this, it'll take a long time, but thereafter, it only needs to find the directories that were created/modified since the last run, then for just those paths, diff the current file inventory against the previous file name list, and write the set of new files to a separate log file. (one approach is given below)
    • adapt your re-indexing job so that it works from the log of new files, and it doesn't use find at all.

    Of course, you can put the two steps together into a single script.
    #!/usr/bin/perl # Program: find-new-files.perl # Purpose: initialize and maintain a record of files in a # directory tree # Written by: dave graff # If a file called "paths.logged" does not exist in the cwd, we create # one, and treat all contents under cwd as "new". If "paths.logged" # already exists, we find directories with modification dates more # recent than this file, and treat only these as "new". # For each "new" directory, assume a file.manifest is there (create an # empty one if there isn't one), and diff that file against the curren +t # inventory of data files, storing all new files to an array. # Of course, this will fail in all paths where the current user does # not have write permission, but such paths can be avoided by adding # a suitable condition to the first "find" command. use strict; my $path_log = "paths.logged"; my ($list_name,$new_list) = ("file.manifest","new.manifest"); my $new_flag = ( -e $path_log ) ? "-newer $path_log" : ""; my @new_dirs = `find . -type d $new_flag`; # add "-user uname" and/or "-group gname" to avoid directories where # the current user might not have write permission my $diff_cmd = "cd 'THISPATH' && touch $list_name && ". "find . -type f -maxdepth 1 | tee $new_list | diff - $list_name | +grep '<'"; # the shell functions in $diff_cmd will: # - chdir to a given path, # - create file.manifest there if it does not yet exist, # - find data files in that path (not subdirs, not files in subdirs), # - create a "new.manifest" file containing this current file list, # - diff the new list of files against the existing file.manifest, # - return only current files not found in the existing manifest. # since it's a sub-shell, the chdir is forgotten when the sub-shell is + done. open( OUT, ">new-file-path.list" ); foreach my $path ( @new_dirs ) { chomp $path; my $cmd = $diff_cmd; $cmd =~ s{THISPATH}{$path}g; # the output of the shell command needs to be conditioned to have the # path string prepended to each file name (we can leave the new-line # in place at the end of the name): print OUT join "", map { s{^< \.(.*)}{$path$1}; $_ } ( `$cmd` ); # replace the old manifest: rename "$path/$new_list", "$path/$list_name" or warn "failed to update $path/$list_name\n"; } close OUT; `touch $path_log`;

    update: fixed some commentary in the second program; realized that <br> is required after </ul>.
      Wow! graff, that's some help, if you just 'whipped up' that script in response to my problem...

      Happily, it'll be here at PM for everyone else to find too... i'm not sure it's applicable in this exact problem (though it'll be useful to me in another area... thanks! i hadn't even asked that question yet!), as i need the code to NOT depend on the local find command. I'd really prefer this script to be portable across Solaris and Linux systems, with differing find commands.

      i'll be "flattering" your timing script too, optomizing this thing with whichever technique seems best.

Re: File::Find redux: how to limit hits?
by crazyinsomniac (Prior) on Jun 04, 2002 at 01:12 UTC
    use the 'preprocess ' option. Your other option is to mess with signals (see perlsig, %SIG)
    #!/usr/bin/perl -w use strict; use File::Find; # scrap codeee find( { wanted => \&wanted, # preprocess => \&preprocess, }, '/' ); BEGIN { use vars qw( $maxHit $Hit); $maxHit = 10; } BEGIN { $SIG{__DIE__} = sub { my @d = @_; if($d[-1] eq "dying max $maxHit") { warn "hit the max $maxHit"; return(); } else { warn(@_); exit(1); } }; } sub wanted { my $file = $_; print "$file\n"; $Hit++; die "dying max $maxHit" if($Hit >= $maxHit) } ## something like this sub preprocess { my @list = @_; my $diff = $maxHit - $Hit; warn "pre processing( $Hit, $maxHit, ".scalar(@list)." )"; if($Hit > $maxHit ) { return (); }elsif( @list > $diff ) { warn "splicing";use Data::Dumper; @list = splice(@list,0,$diff); warn Dumper \@list; } return @list; }
    update: you better stick with the eval ( see Kanjis node above). I've been too signal happy lately ;)

    update: please note that the preprocess option is not available everywhere (best upgrade File::Find)

    I also feel that there ought to exist File::Find::breakout, which would basically stop find from executing anymore (return? ;)

    UPDATE: after hastily posting this, which i still think is good idea, a couple of smart asses point out goto. Yeah well i'm not 0 for 2 so far ;)(dammit! sometimes solutions you know but don't love escape you)

    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"


      The preprocess mention is great - because now I can propose an addition to fix what had bugged me about 914's current implementation of the min/maxdepth behaviour: the fact that the files have to be looked at, and if for no other other reason than to discard them. preprocess lets one avoid that:

      sub prep { my $depth = $File::Find::dir =~ tr[/][]; return if $depth > $max_depth; return grep -d "$File::Find::dir/$_", @_ if $depth < $min_depth; @_; }

      This way, in directories below the mindepth, nothing other than directories is even looked at. Also further recursing down the current branch of three directory tree is aborted immediately upon entering a directory deeper than maxdepth.

      The mindepth test in wanted() is still necessary because the directories below mindepth will have to be processed; the maxdepth test there is superfluous.

      I'm also quite confident that this will cut the runtime down far enough that the maxhits kludges are unnecessary.

      Makeshifts last the longest.

      That's exactly what i meant.... it (a 'breakout' option) would seem to be a very useful thing to have..

      Though it seems (from that list traffic you linked) that it's not typical for some duffer (read: me) to just modify a Module as storied and widespread as File::Find

      Anyhow, it looks to me as though the $File::Find::prune way mentioned by grinder might be the way to go...

      i'm pretty sure that the files i want will always be found first, since they're the most recent ones. Every time i've run this, the output array fills up in most-recentness order, on linux and solaris both. Clearly it's not alphabetical (as someone mentioned), and in fact i'm depending on this characteristic in another part of the script (there's a way to make it more robust, i'll do that later).

      i'm not sure i totally understand your code, but i'll study it when i get home and can play.... thanks!

Re: File::Find redux: how to limit hits?
by belg4mit (Prior) on Jun 04, 2002 at 02:39 UTC
    goto :-)

    perl -pew "s/\b;([mnst])/'$1/g"

Re: File::Find redux: how to limit hits?
by jplindstrom (Monsignor) on Jun 04, 2002 at 17:42 UTC
    The eval/die works fine, but you should note that it will mess upp your current working directory unless you specify no_chdir => 1 in your call to find(), and then you have to use $File::Find::name to access the file. This works for me (it really should be improved to die with a particular text, like this post: Re: File::Find redux: how to limit hits?):

    =head2 raDataFileFind($dir, [$noMax = 0]) Return array ref with relative file names from the $dir directory. $noMax -- The maximum number of files returned (0 means no limit). Return [] on errors. =cut sub raDataFileFind { my ($dir, $noMax) = @_; $noMax ||= 0; my $no = 0; my @aFile; eval { find( { wanted => sub { if(/\.txt(\.gz)?$/) { die() if($noMax && ($no++ >= $noMax)); push(@aFile, $File::Find::name); }; }, no_chdir => 1, }, "$dir/"); }; return(\@aFile); }

    Does anyone know why the author of File::Find elected to make it the default action to chdir into the subdirectories? Is there any benefit of doing that, or is it an arbitrary choise?


Re: File::Find redux: how to limit hits?
by Aristotle (Chancellor) on Jun 04, 2002 at 01:12 UTC
    You could die if $max_hits--; in your wanted(). You only have to wrap the find() in an eval to catch the exception so that it doesn't actually abort the script. I'm not sure you're not making a dangerous assumption in that the files you want will be there by the time you die, though.

    Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://171367]
Approved by rob_au
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2018-05-27 16:28 GMT
Find Nodes?
    Voting Booth?