Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^5: Finding files recursively

by holli (Abbot)
on Aug 05, 2019 at 08:16 UTC ( [id://11103911]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Finding files recursively
in thread Finding files recursively

You should probably test on a smaller data set then? Anyway, I'm getting different results, my original code being roughly 55% faster on my single user machine (as expected).

I added a native Perl implementation that walks the tree itself with no overhead and that gains you another significant speed boost.
D:\ENV>perl pm10.pl Holli (New). Found: 1 ( D:\env\Videos/2012 ) Time: -19 Holli (original). Found: 1 ( d:\env/Videos/2012 ) Time: -32 ovedpo15. Found: 1 ( d:/env/Videos/2012 ) Time: -51
Using this code.
use File::Find; use File::Spec::Functions; use Cwd qw(abs_path); sub holli2 { my @found; my $path = 'd:\env'; my $target = "2012.avi"; myfind( \@found, $path, $target ); print "Holli (New). Found: ", scalar @found, " ( @found )", "\n"; } sub myfind { my ($found, $path, $target ) = @_; if ( opendir( my $in, $path ) ) { while ( my $dir = readdir($in) ) { next if $dir =~ /^\.\.?$/; my $entry = "$path/$dir"; if ( -d $entry ) { myfind( $found, $entry, $target ) } else { push @$found, $path if ($dir eq $target) && (!-e "$path/.ignore"); } } closedir($in); } else { print qq(Skipping "$path": $!\n); return; } } sub holli { my @found; my $path = 'd:\env'; my $target = "2012.avi"; find ( sub { # We're only interested in directories return unless -d $_; # Bail if there is an .ignore in the current directory return if -e "$_/.ignore"; # Add to the results if the target is found here push @found, $File::Find::name if -e "$_/$target"; }, $path); print "Holli (original). Found: ", scalar @found, " ( @found )", " +\n"; } sub ovedpo15 { my @found; find( sub { get_dirs( \@found, $_ ) }, 'd:\env'); print "ovedpo15. Found: ", scalar @found, " ( @found )", "\n"; } sub get_dirs { my ($dirs_aref, $current_path) = @_; $triesOvedpo15++; eval { my $abs_path = abs_path($current_path); my $file = $abs_path."/2012.avi"; my $ignore_file = $abs_path."/".".ignore"; push (@{$dirs_aref},$abs_path) if((-e $file) && !(-e $ignore_f +ile)); }; } my $t = time(); holli2(); print "Time: ", ($t - time), "\n"; $t = time(); holli(); print "Time: ", ($t - time), "\n"; $t = time(); ovedpo15(); print "Time: ", ($t - time), "\n";


holli

You can lead your users to water, but alas, you cannot drown them.

Replies are listed 'Best First'.
Re^6: Finding files recursively
by ovedpo15 (Pilgrim) on Aug 05, 2019 at 17:53 UTC
    Thank you for the good answer. It does reduce the time but not as much (like ~10 min of 4 hours), so I'm still hunting for more ideas.
    In the following link: https://stackoverflow.com/questions/2681360/whats-the-fastest-way-to-get-directory-and-subdirs-size-on-unix-using-perl
    Someone suggeted:

    I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.

    Is it possible to show what does he mean? I though maybe to implement a smart subroiute that can find big directories that contain subdirectories and use the idea to catch all the valid dirs and then merge into one array. Thank you again.
      I would expect more than 4% speedup. You mentioned other users. Are you running this on some kind of shared network drive? If so, then THAT is your bottleneck. It's hard to say wether parallelization will speed up things without knowing more about the directory structure.


      holli

      You can lead your users to water, but alas, you cannot drown them.
        Tried a few tests, it always returns 10-15 min difference. We use VNC so other users also use the machine but it should not affect (as much) the penalty for searching. Isn't fork() a good idea when we have big directories?
      I am not sure about this idea, but it is an idea to try.
      File::Find calls the "wanted" sub for each "file" that it finds.
      A directory is actually a special kind of a file.

      When File::Find enters a directory, there is a pre-process sub that can be called for example to sort the order in which the files in that directory will be fed to the wanted() sub.

      Perhaps it could be that using this preprocess sub may make things faster? I don't know. I've never had to worry about performance at this level

      All of this File::Find stuff works upon the volume directory. All of that info will quickly become memory resident. The size of the disk and how much data is upon it doesn't matter.

      For your application, the number of directories matters. If you know all of the directories, the file system can determine quickly if the .ignore or the target file '2012.avi' exists in that directory or not. That sort of query could potentially be multi-threaded.

      There are ways in which your program can be informed by the O/S when a new directory is created. I suppose that if you know what the result was one hour ago, that might help with the calculation of the current result? The details of your app are a bit unclear to me.

      Anyway, below is an idea to benchmark. I don't know what the result will be.
      Code hasn't been run.. just an idea..

      use strict; use warnings; use File::Find; my @found; my $target = '2012.avi'; my %options = (preprocess =>\&preprocess_dir, wanted =>\&wanted ); find( \%options, "C:/test"); sub preprocess_dir { my @avi_path=(); foreach my $this_name (@_) { return () if ($this_name =~ /\.ignore$/); # defer judgement if $target is found push @avi_path, ("$File::Find::dir/".$target) if $this_name =~ +/$target/; } # ./ignore wasn't found push @found, @avi_path; return (); #Nothing for the wanted sub to do... } sub wanted {return();}; # nothing to do here # because list of files is always empty

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11103911]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (9)
As of 2024-04-18 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found