http://www.perlmonks.org?node_id=951304

gdanenb has asked for the wisdom of the Perl Monks concerning the following question:

I need to optimize performance for script that supposed to scan filesystem and collect info on *.msg files.

I made a script recursive but I'm not sure that this is the best way for implementation.

Additionally I look for top boxes with sort on hash keys

The directory structure is /test_vol/0/00/34567@test.com/1_sdr/34567.msg where range is 0-9 and 00-99 and number of mail accounts is huge

The code is:
sub ScanDirectory { my ($workdir) = @_; my($startdir) = &cwd; # keep track of where we began chdir($workdir) or die "Unable to enter dir $workdir: $!\n"; opendir(DIR, ".") or die "Unable to open $workdir: $!\n"; my @names = readdir(DIR); closedir(DIR); foreach my $name (@names){ next if ($name eq "."); next if ($name eq ".."); next if ($name =~ /\.dat$|\.mdb^|\.snapshot/); if ( -d $name ) { if ($name =~ /^\d+\@\w/ ) { $all_mailbox_count++; $box_size=0; } &ScanDirectory($name); next; } if ( $name =~ /\.msg$/ ) { my $msg_size=(stat($name))[7]; if ( $msg_size < 4096 ) { $box_size+=4096; } else { $box_size+=$msg_size; } } } if ( $workdir =~ /(\d+)\@/ ) { $msisdn=$1; $all_mailbox_size+=$box_size; if ( $box_size == 0 ) { $empty_mailbox++; } else { &top_size_mailbox($msisdn,$box_size); } } chdir($startdir) or die "Unable to change to dir $startdir: $! +\n"; } sub top_size_mailbox { my ($msisdn,$box_size)=@_; if ( keys( %top_size_mailbox ) < $num_top_size_box ) { $top_size_mailbox{$box_size}=$msisdn; } else { my $min=(sort {$a <=> $b} keys %top_size_mailbox)[0]; if ( $box_size > $min ) { delete $top_size_mailbox{$min}; $top_size_mailbox{$box_size}=$msisdn; } } }

Replies are listed 'Best First'.
Re: Optimizing performance for script to traverse on filesystem
by kejohm (Hermit) on Feb 01, 2012 at 21:37 UTC

    Rather than rolling your own code, you should probably use a module like File::Find. Here is an example that should do want you're trying to do (untested):

    #!perl use 5.012; use File::Find; use List::Util qw(min); my $workdir = ...; my %top_size_mailbox; my $box_size; my $all_mailbox_size = 0; my $all_mailbox_count = 0; my $empty_mail_box = 0; find( { -wanted => sub { next if m/^\.+$/; next if m/\.(dat|mdb|snapshot)$/; if ( -d and m/^\d+\@\w/ ) { $all_mailbox_count++; $box_size = 0; } elsif ( m/\.msg$/ ) { my $msg_size = -s _; if ( $msg_size < 4096) { $box_size += 4096; } else { $box_size += $msg_size; } } }, -postprocess => sub { if ( $File::Find::dir =~ m/(\d+)\@/ ) { my $msisdn = $1; $all_mailbox_size += $box_size; if ( $box_size == 0 ) { $empty_mailbox++; } else { top_size_mailbox( $msisdn, $box_size ); } } }, } $workdir, ); sub top_size_mailbox { my ( $msisdn, $box_size ) = @_; if ( keys( %top_size_mailbox ) < $num_top_size_box ) { $top_size_mailbox{$box_size} = $msisdn; } else { my $min = min( keys %top_size_mailbox ); if ( $box_size > $min ) { delete $top_size_mailbox{$min}; $top_size_mailbox{$box_size} = $msisdn; } } } __END__

    There are other similar modules like File::Find::Rule that you could also try.

Re: Optimizing performance for script to traverse on filesystem
by graff (Chancellor) on Feb 02, 2012 at 03:41 UTC
    I'll be the devil's advocate and suggest that doing your own recursive solution for traversing a directory tree can save some run time. If you have directories with outrageous quantities of files (e.g. more then 100K files/directory), then a minimal opendir/readdir recursion can even save time over unix/linux "find".

    The OP code might be a little more streamlined at run-time by using

    while ( my $name = readdir( DIR )) { ... }
    instead of loading all directory entries into an array.

    In case it helps, here's a similar "hand-rolled" recursive traversal script: Get useful info about a directory tree -- it produces different results from what you want, but the basic recursion part is pretty much the same as yours. I even benchmarked it against a File::Find approach, which took noticeably longer to run, possibly due to the number of subroutine calls per directory entry that File::Find does.

      I guess that I'm the "devil's advocate" to the "devil's advocate"?

      re: File::Find - I think that we could cooperate and possibly increase internal performance (I'm game for that), but the interface is "spot on" - it works!.

      My suggested modifications to the OP's code represents a massive simplification of program logic.

      There is only one file system operation that happens per $File:Find::name. Maybe File::Find does some more "under the covers"? I'm not sure what you are proposing... But basically, I see no problem with code that makes a single decision based upon a single input.

      I'm game to increase the performance of File::Find - are you willing to help me do it?
      I think that will be be a pretty hard undertaking.
      I'm not sure that it is even possible.
      But if it is, let's go for it!

        Thank you for the invitation. Actually, it might be a worthwhile first step just to make sure my assertion isn't based on faulty evidence. If you get a chance to check out the benchmark in the thread I cited above (specifically at this node: Re^2: Get useful info about a directory tree), it's entirely possible that the timing results there are reflecting something other than a difference between File::Find and straight recursion with opendir/readdir.

        (I've seen enough benchmark discussions here at the monastery to know that a proper benchmark can be an elusive creature.)

        If that benchmark happens to be a valid comparison of the two approaches, it would also be a good exercise for a debugger or profiler session, to see what's causing the difference.

        In any case, I definitely don't want to dissuade people from using File::Find or its various derivatives and convenience wrappers -- they do make for much easier solutions to the basic problem, and in the vast majority of cases, a little extra run time is a complete non-issue. (It's just that I've had to face a few edge cases where improving run time when traversing insanely large directories made a big difference.)

      If I use

      while ( my $name = readdir( DIR )) { ... }
      I have to leave DIR opened while walking deeper in recursive
      Only when all directories on the level are scaned, I can closedir(DIR)
      Isn't it problematic ?

        Is the structure likely to be more than a few tens of directories deep? If not, no problem. If it is then you'll have to work really hard to fix the problem regardless of what tools you use because most simple solutions will keep directory handles open.

        True laziness is hard work
Re: Optimizing performance for script to traverse on filesystem
by Marshall (Canon) on Feb 01, 2012 at 23:45 UTC
    I would definitely recommend using File::Find or one of its variants (and there are some fancy ones) to do the directory scanning. This eliminates the need for you to write any recursive code yourself.

    I didn't test the code below and there is bound to be some kind of mistake in it. But this is to give you an idea of another approach.

    The most simple variant of File::Find calls a subroutine for every file or directory underneath the starting place. A localized variable $File::Find::name contains the full path of where we currently are (dir name or file name). I suggest that you just run the wanted sub with just the print line at the end (in comments) to see the default order of the descent.

    I think this collects the data that you wanted? But not 100% sure that I got everything.

    Since you are interested in performance one not so obvious point, is the default _ (underscore) variable. Not $_, just plain "_". When you do a file test operation, all this "stat" info gets collected via a file system request. If you want another file test on the same file, (like -s,-f,-d), using the "_" variable means to use use previous stat info without making another expensive call to the file system.

    Hope this at least provides some fuel for thought and further improvement.

    #!/usr/bin/perl -w use strict; use File::Find; # find() calls the wanted subroutine for each file and directory # underneath the starting point(s) (can specify more than one # directory tree to traverse down) find(\&wanted, "C:/temp"); # within subroutine "wanted": # use these variables to figure out where you are: # $File::Find::dir is the current directory name, # $_ is the current filename within that directory # $File::Find::name is the complete pathname to the file. # wanted() cannot return anything directly # declare a data struct at a higher scope that this sub # writes to my %mailboxesWithMessages; my %allMailboxes; sub wanted { #full path must contain a mailbox number my $mbox; return() unless ( ($mbox) = $File::Find::name =~ /(\d+)\@/); if (-d $File::Find::name) #some mbox may have no .msg files { $allMailboxes{$mbox}=1; return; } #must be looking at a messsage file return() unless ( -f _ and $File::Find::name =~ /\.msg$/); my $size = -s _; $size = 4096 if $size < 4096; $mailboxesWithMessages{$mbox}+= $size; # comment out above and run this sub with just this line to # see what it does... # print "$File::Find::name\n"; return; }
    sort %mailboxesWithMessages to get the biggest one(s). Cycle thorough %allMailboxes - any key there that doesn't exist in the other hash means a mbox with no messages (empty).

      I tested the same logic in a way you suggested and found it a little bit slower for a big directories.
      But writing is much easier of course