Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Searching for files efficiently help!

by Anonymous Monk
on Nov 16, 2011 at 15:31 UTC ( #938396=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
The values in the array @files will have the file name attached to its path, like: /var/www/loc/file/subdir/1234567_bc_20101000.txt  /var/www/loc/file/tree/text/1234567_bc_20101005.txt.
I need to match against the file names in the array @filelist, reported it and eventually delete from the $startdir.
I can get what I want as it is, but my question is about efficiency and speed since the array @filelist can be really large and the directory structure in $startdir is really large and complex.
Is there a better way to accomplish this than what I have here?
I wonder on the nested foreach loops I have on the code, any suggestions would be very helpful.
#!/usr/bin/perl -w use strict; use File::Find::Rule; my $startdir = "/allfiles"; my @filelist = qw(1234567_bc_20101000.txt 99877_xy_20111111.txt); #for + testing my $includeFiles = File::Find::Rule->file ->name('*.txt'); # search by file extensi +ons my @files = File::Find::Rule->or( $includeFiles ) ->in($startdir); #locate only txt files in the starting directory my $includeFiles = File::Find::Rule->file ->name('*.txt'); # search by file extensi +ons my @files = File::Find::Rule->or( $includeFiles ) ->in($startdir); my $f_name; my $c=0; foreach my $files(@files) { $c++; if($files=~/(.*?)\/([^\/]+)$/) { $f_name = $2; } foreach my $chk_file(@filelist) { if($chk_file=~/$f_name/) { print "$c - Found and to be deleted: $files\n"; } } }

Thank you very much for looking!!!

Comment on Searching for files efficiently help!
Select or Download Code
Re: Searching for files efficiently help!
by jethro (Monsignor) on Nov 16, 2011 at 16:06 UTC

    Put your files into a hash. then you can throw away that last foreach loop and instead simply have this

    if (exists $filelist{$f_name}) { print "$c - Found and to be deleted: $files\n"; }

    Hash key in %filelist would be the filename, hash value is unimportant, you may use 1 or even further information about the file in there

      Do you mean something like this?
      #!/usr/bin/perl -w use strict; use File::Find::Rule; my $startdir = "/allfiles"; my @filelist = qw(1234567_bc_20101000.txt 99877_xy_20111111.txt); #for +testing my %filelist = map {$_, 1} @filelist; my $includeFiles = File::Find::Rule->file ->name('*.txt'); # search by file extensi +ons my @files = File::Find::Rule->or( $includeFiles ) ->in($startdir); #locate only txt files in the starting directory my $includeFiles = File::Find::Rule->file ->name('*.txt'); # search by file extensi +ons my @files = File::Find::Rule->or( $includeFiles ) ->in($startdir); my $f_name; my $c=0; foreach my $files(@files) { $c++; if($files=~/(.*?)\/([^\/]+)$/) { $f_name = $2; } if (exists $filelist{$f_name}) { print "$c - Found and to be deleted: $files\n"; } }

      But when I print the results I am getting all files and not only the ones that was found!
      Thanks!

        Can't confirm your observation. I tested this script and it worked as expected, after throwing away the duplicate lines "my $includeFiles = ..." and "my @files = ..".

      The directory where the search will start has about 10GB of files in it, do you think that this code will be efficient enough to handle such a directory size?

        There would seem to be two ways you can do this (but see correction below):

        1. Go through your array, deleting each file name if it exists as a file, or
        2. Go through your directory structure, checking each filename/path against your array (after turning it into a hash), and deleting it if it exists in the hash.

        Generally, I would prefer the first method. It's almost certain to be faster to go through a list of files and check for their existence than to traverse an entire directory structure and check every file against a list. If you simply go through your array, checking for the existence of each pathname and deleting if it's found, then it doesn't matter how large or complex your directory structure is.

        for my $file (@files){ if( -f $file ){ report($file); # however you want to report a match if( unlink $file ){ print "Deleted $file\n"; } else { warn "Unable to delete $file\n"; } } }

        Correction: As Jethro pointed out, I misunderstood the original requirements, getting the two arrays he mentioned mixed up. The array he wants to check the files against does not have full path names, so my solution won't work. He will have to recurse through the directory structure and check them one by one.

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

Re: Searching for files efficiently help!
by hbm (Hermit) on Nov 16, 2011 at 16:48 UTC

    Definitely what jethro says.

    A minor note - your $files regex stores two values but you only use one. And if you change the delimiter, you don't have to escape the slash. So, instead of:

    if($files=~/(.*?)\/([^\/]+)$/) { $f_name = $2; }

    Either:

    # match all non-slashes up to end of line if($files=~m|([^/]+)$|) { $f_name = $1; }

    Or:

    # match everything beyond last slash if($files=~m|.*/(.+)|) { $f_name = $1; }
      I have a question for you, if the values in this array @filelist was like this:my @filelist = qw(88732 99877 76211); I would have to match if these values are part of these file names in array @files, how would the regular expression be in this case. In another words matching values from @filelist against values in in the array @files that would be like this: my @files = qw(99877_bc_20101000.txt 99877_xy_20111111.txt 76211_bc_20101000.txt);
      I hope I was clear on that. Thanks!
        In this case I would put back the second foreach loop:
        foreach my $chk_file(@filelist) { #if($chk_file=~/$f_name/) { if($f_name=~/$chk_file/) { print "$c - Found and to be deleted: $files\n"; } }

        If I understand, you can simply narrow your regex to match up to the first underscore, and see if that 'exists':

        foreach my $files(@files) { $c++; if($files=~m|.*/([^_]+)|) { $f_name = $1; } if (exists $filelist{$f_name}) { print "$c - Found and to be deleted: $files\n"; } }

        I might go as far as:

        for (@files) { $c++; if(m|.*/([^_]+)| && exists $filelist{$1} ) { print "$c - Found and to be deleted: $_\n"; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://938396]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (18)
As of 2014-10-30 19:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (208 votes), past polls