Re^5: Finding files recursively

You should probably test on a smaller data set then? Anyway, I'm getting different results, my original code being roughly 55% faster on my single user machine (as expected).

I added a native Perl implementation that walks the tree itself with no overhead and that gains you another significant speed boost.

D:\ENV>perl pm10.pl
Holli (New). Found: 1 ( D:\env\Videos/2012 )
Time: -19
Holli (original). Found: 1 ( d:\env/Videos/2012 )
Time: -32
ovedpo15. Found: 1 ( d:/env/Videos/2012 )
Time: -51
[download]

Using this code.

use File::Find;
use File::Spec::Functions;
use Cwd qw(abs_path);

sub holli2
{
    my @found;
    my $path   = 'd:\env';
    my $target = "2012.avi";
    
    myfind( \@found, $path, $target );
    print "Holli (New). Found: ", scalar @found, " ( @found )", "\n";
}

sub myfind
{
    my ($found, $path, $target ) = @_;
    
    if ( opendir( my $in, $path ) )
    {
        while ( my $dir = readdir($in) )
        {
            next 
                if $dir =~ /^\.\.?$/;
            
            my $entry = "$path/$dir";
            
            if ( -d $entry )
            {
                myfind( $found, $entry, $target )
            }
            else
            {
                push @$found, $path
                    if ($dir eq $target) && (!-e "$path/.ignore");
            }
        }
        closedir($in);
    }
    else
    {
        print qq(Skipping "$path": $!\n);
        return;
    }

}

sub holli
{
    my @found;
    my $path   = 'd:\env';
    my $target = "2012.avi";

    find ( sub { 
        # We're only interested in directories
        return 
            unless -d $_;
        
        # Bail if there is an .ignore in the current directory
        return 
            if -e "$_/.ignore";
            
        # Add to the results if the target is found here
        push @found, $File::Find::name
            if -e "$_/$target";

    }, $path);
    print "Holli (original). Found: ", scalar @found, " ( @found )", "
+\n";
}


sub ovedpo15 
{
    my @found;
    find( sub { get_dirs( \@found, $_ ) }, 'd:\env');
    print "ovedpo15. Found: ",  scalar @found, " ( @found )", "\n";
}

sub get_dirs {
    my ($dirs_aref, $current_path) = @_;
    $triesOvedpo15++;
    eval {
        my $abs_path = abs_path($current_path);
        my $file = $abs_path."/2012.avi";
        my $ignore_file = $abs_path."/".".ignore";
        push (@{$dirs_aref},$abs_path) if((-e $file) && !(-e $ignore_f
+ile));
    };
}

my $t = time(); 
holli2();
print "Time: ", ($t - time), "\n";

$t = time(); 
holli();
print "Time: ", ($t - time), "\n";

$t = time(); 
ovedpo15();
print "Time: ", ($t - time), "\n";
[download]

holli

You can lead your users to water, but alas, you cannot drown them.

Comment on Re^5: Finding files recursively Select or Download Code

Replies are listed 'Best First'.
Re^6: Finding files recursively by ovedpo15 (Pilgrim) on Aug 05, 2019 at 17:53 UTC
Thank you for the good answer. It does reduce the time but not as much (like ~10 min of 4 hours), so I'm still hunting for more ideas. In the following link: https://stackoverflow.com/questions/2681360/whats-the-fastest-way-to-get-directory-and-subdirs-size-on-unix-using-perl Someone suggeted: I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them. Is it possible to show what does he mean? I though maybe to implement a smart subroiute that can find big directories that contain subdirectories and use the idea to catch all the valid dirs and then merge into one array. Thank you again.	[reply]
Re^7: Finding files recursively by holli (Abbot) on Aug 05, 2019 at 21:38 UTC
I would expect more than 4% speedup. You mentioned other users. Are you running this on some kind of shared network drive? If so, then THAT is your bottleneck. It's hard to say wether parallelization will speed up things without knowing more about the directory structure. holli You can lead your users to water, but alas, you cannot drown them.	[reply]
Re^8: Finding files recursively by ovedpo15 (Pilgrim) on Aug 06, 2019 at 07:01 UTC
Tried a few tests, it always returns 10-15 min difference. We use VNC so other users also use the machine but it should not affect (as much) the penalty for searching. Isn't fork() a good idea when we have big directories?	[reply]
Re^9: Finding files recursively by afoken (Chancellor) on Aug 06, 2019 at 08:30 UTC
Re^9: Finding files recursively by holli (Abbot) on Aug 06, 2019 at 10:48 UTC
Re^9: Finding files recursively by bliako (Monsignor) on Aug 06, 2019 at 15:36 UTC
Re^10: Finding files recursively by afoken (Chancellor) on Aug 06, 2019 at 20:19 UTC
Some notes below your chosen depth have not been shown here
Re^7: Finding files recursively by Marshall (Canon) on Aug 09, 2019 at 08:23 UTC
I am not sure about this idea, but it is an idea to try. File::Find calls the "wanted" sub for each "file" that it finds. A directory is actually a special kind of a file. When File::Find enters a directory, there is a pre-process sub that can be called for example to sort the order in which the files in that directory will be fed to the wanted() sub. Perhaps it could be that using this preprocess sub may make things faster? I don't know. I've never had to worry about performance at this level All of this File::Find stuff works upon the volume directory. All of that info will quickly become memory resident. The size of the disk and how much data is upon it doesn't matter. For your application, the number of directories matters. If you know all of the directories, the file system can determine quickly if the .ignore or the target file '2012.avi' exists in that directory or not. That sort of query could potentially be multi-threaded. There are ways in which your program can be informed by the O/S when a new directory is created. I suppose that if you know what the result was one hour ago, that might help with the calculation of the current result? The details of your app are a bit unclear to me. Anyway, below is an idea to benchmark. I don't know what the result will be. Code hasn't been run.. just an idea.. use strict; use warnings; use File::Find; my @found; my $target = '2012.avi'; my %options = (preprocess =>\&preprocess_dir, wanted =>\&wanted ); find( \%options, "C:/test"); sub preprocess_dir { my @avi_path=(); foreach my $this_name (@_) { return () if ($this_name =~ /\.ignore$/); # defer judgement if $target is found push @avi_path, ("$File::Find::dir/".$target) if $this_name =~ +/$target/; } # ./ignore wasn't found push @found, @avi_path; return (); #Nothing for the wanted sub to do... } sub wanted {return();}; # nothing to do here # because list of files is always empty [download]	[reply] [d/l]