http://www.perlmonks.org?node_id=324335

crabbdean has asked for the wisdom of the Perl Monks concerning the following question:

I've written a program to traverse our file server and remove temp files and unwanted old directories on users directories. In development and testing it worked fine but on first run on the file server it died by the time it got to users beginning with letter D.

I rewrote a few things, put in a few evals to make sure certains tests completed and re-ran it. Again it died at ABOUT the same point.

On monitoring it I noticed the script was munching memory. I watched in horror as my system slowly ground itself to a halt. The File::Find module is the heart of the program and it does call it recursively. I've since tested and its not the recursiveness that's a problem. I wrote a simple script to just loop using the File::Find module on my own directory (see below) as a test. The script below doesn't munch memory it as fast but still you can see it disappearing (my program is a little more intensive and chews through it faster). Obviously with the size of our file server this solution isn't viable like this.

My only solution at this stage seem to be to either find the leek in the File::Find module or work around it. Anyone else come across this problem?

Anyone aware of a garbage collection module?

Thanks
Dean

#!perl use File::Find; while (1) { print "\n\nStarting again ....\n\n"; sleep 2; find (\&processfiles, "\\\\nwcluster_vol1_server\\vol1\\Users\\Cra +bbD"); } sub processfiles { print "$File::Find::name\n"; }

Replies are listed 'Best First'.
Re: File::Find memory leak
by samtregar (Abbot) on Jan 27, 2004 at 04:08 UTC
    There's no such thing as a "garbage collection" module. Perl does its own garbage collection using reference counting and if something's getting lost there's not much you can do about it (aside from fixing the leaky code).

    If you can't find and fix the leak you'll probably have to fork() a sub-process to do whatever leaks, pass the results up to the parent via a pipe or temp file and then exit() the child. When the child exits any memory it used will be reclaimed by the operating system. I've used this technique before with leaky Perl modules. Give it a try and post again if you have trouble.

    -sam

    PS: The above suggestion assumes you're working on a Unix system. I imagine things are different in Windows-land, where fork() is emulated with threads and exit() probably doesn't free memory.

      Thanks Sam, that was exactly my thinking. Great minds! If the fork doesn't work, a simpler and possible alternative is to write a main script that does all the logging and a second script is called each time it traverses a users directory which contains the "File::Find" module. This will keep it constantly freeing memory it uses. I'll let you know the results.

      The "perltodo" manual page says some garbage collection work is still to be done in future for perl.

      Thanks
      Dean
Re: File::Find memory leak
by BrowserUk (Patriarch) on Jan 27, 2004 at 06:11 UTC

    Using 5.8.2 (AS808) on XP, and processing a little over 200_000 files, I see a growth pattern of around 22k per iteration, or maybe 10 bytes per file.

    If I fork each iteration of the search, the growth appears to be increased slightly to 31k/iter of 205428 files.

    Doing a crude comparision of heap dumps taken before & after an iteration, it appears as if the leakage isn't due to something not being freed, but rather to fragmentation of the heap, as larger entities are freed and their space half re-used for smaller things, thereby requiring the heap to grow the next time the larger entity needs to be allocated.

    Note: The comparision was very crude...with something like 12000 individual blocks on the heap, it had to be:)

    Having the script exec itself after each iteration does stop the growth, but whether that is practical will depend upon the nature and design of your program.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!

Re: File::Find memory leak
by graff (Chancellor) on Jan 27, 2004 at 14:17 UTC
    I don't mean to spoil the fun of using perl, but in a case like this, I would consider looking at a Windows port of the GNU find utility. It will undoubtedly be faster and have a smaller memory footprint. (Frankly, the File::Find module seems to be a fountain of difficulty... I tend to avoid it.)
      Thanks graff, this is a clever solution. I'm all up for looking at alternatives. Am looking into this now. Likewise I'm also tending to avoid the File::Find module. I'm having to rewrite my program now as the use of the File::Find module was at the heart of it, and its rendered my program obsolete as a practical solution due to the sheer size of our file server.

      Thanks mate
      Dean
        Just a thought about something you might try... This works for me under unix, and I expect it would work in windows as well. It's very good in terms of using minimal memory, and having fairly low system overhead overall:
        chdir $toppath or die "can't cd to $toppath: $!"; open( FIND, "find . -type d |" ) or die "can't run find: $!"; while ( my $d = <FIND> ) { chomp $d; unless ( opendir( D, $d )) { warn "$toppath/$d: open failed: $!\n"; next; } while ( my $f = readdir( D )) { next if ( -d "$d/$f" ); # outer while loop will handle all dir +s # do what needs to be done with data files } # anything else we need to do while in this directory } close FIND;
        This has the nice property that all the tricky recursion stuff is handled by "find", while all the logic-instensive, file-based stuff is handled pretty easily by perl, working with just the data files in a single directory at any one time.
Re: File::Find memory leak
by Anonymous Monk on Jan 27, 2004 at 04:29 UTC
    Do new files keep being created in that directory? Are they symlinks? What version of File::Find do you have? What perl version?
      Its the lastest version of perl, just downloaded last week. New Files are created all the time, its our main file server, very large. Many symlinks but I don't follow them. Running this on Win2000.

      Dean
Re: File::Find memory leak
by tachyon (Chancellor) on Mar 10, 2004 at 03:55 UTC

    Saw your recent post with this link. You should find that a variation on this will work and it does not leak. It 'recurses' width first using a very useful perl hack - you can push to an array you are iterating over (dont shift or splice but). All it does is push the dirs into its dir list as it finds them. Short, simple and fast.

    This builds an array of all the files it finds (full path) but you could stick your &wanted code in there instad and have it return void. With UNC paths you will want to swap the / to \\

    sub recurse_tree { my ($root) = @_; $root =~ s!/$!!; my @dirs = ( $root ); for my $dir ( @dirs ) { opendir DIR, $dir or do { warn "Can't read $dir\n"; next }; for my $file ( readdir DIR ) { # skip . dirs next if $file eq '.' or $file eq '..'; # skip symlinks next if -l "$dir/$file"; if ( -d "$dir/$file" ) { push @dirs, "$dir/$file"; } else { push @files, "$dir/$file"; } } closedir DIR; } return \@dirs, \@files; }

    cheers

    tachyon

      Thanks, have tested this and it works nicely. The only problem I can see is that with a large directory, like our terrabyte file server, the return arrays would get too big. It would have to be broken into bite size pieces and returned piecemeal OR you'd have to process the files and directory as you find them instead of pushing them (which would be my most obvious choice).

      Thanks

      Dean
      The Funkster of Mirth
      Programming these days takes more than a lone avenger with a compiler. - sam
      RFC1149: A Standard for the Transmission of IP Datagrams on Avian Carriers

        Actually you *may* need real recursion to do that. You don't have to return the list of files and can certainly process them on the fly. This will of course reduce the in memory array size by orders of magnitude depending on file:dir ratio.

        However using this approach, which as you not works fine, you are basically stuck with an array listing *all* the dirs. There is a reason for this. Although it is safe to push while you iterate over an array it is not safe to shift AFAIK but I have not extensively tested that. The perl docs *do basically say* don't do *anything* while iterating over an array but it copes fine with push. This makes a certain degree of sense as all we are doing is adding to the end of a link list of pointers and incrementing the last index by 1 with each push. In the loop perl is obviously not caching the end of list pointer but must be rechecking each time.

        If you shift then there is an issue. If you are looping from offset N and are at index I and you move N then.....

        Anyway a gig of RAM will cope with ~5-10M+ dirs so it should not be a major issue unless you have very few files per dir.

        As the search is width first you could easily batch it up into a series of sub searches based on 1-2 levels deep if you have serious terrabytes.

        cheers

        tachyon