http://www.perlmonks.org?node_id=490656

jcoxen has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script that parses certain information out of a series of Excel files. The logic to read the files and grab the data I need works fine. My problem is that I keep running out of memory. I'm running into what appears to be a memory leak that manifests itself when I open more files. I'm processing the files sequentially so I would expect that any memory allocated when I open a file would be freed when I open the next but that's not the case. I'm losing 1-2 Kb of memory with each new file. This hasn't been a problem with other ParseExcel scripts I've written since I was only working on 1 or 2 files. Now I'm dealing with over 800 files and it definitely is a problem.

Various searches have come up with alternative cell handling techniques to limit memory usage when parsing large files. I tried that just to make sure but it had no effect. The files I'm working with aren't that large, there's just a lot of them.

The pertinent sections of the code are listed below

# # Subroutines # sub get_cell_val() { my ($src_sheet, $src_row, $src_col) = @_; if (defined $src_sheet->{Cells}[$src_row][$src_col]) { $_ = $src_sheet->{Cells}[$src_row][$src_col]->Value; } else { $_ = ''; } } # # Main Program # my @files = <*.xls>; for my $file (@files) { print "Opening DLR $file\n" if $DEBUG; my $dlr_num = $file; $dlr_num =~ s/.xls//; print "DLR Number is $dlr_num\n" if $DEBUG; my $src_book = new Spreadsheet::ParseExcel::Workbook->Parse($file) +; my $count = $src_book->{SheetCount}; print "Sheet Count = $count\n" if $DEBUG; $count--; my $src_sheet = $src_book->{Worksheet}[$count]; SHEET: for (my $row = 0 ; $row <= 100 ; $row++) { for (my $col = 0 ; $col <= 5 ; $col++) { my $cell_val = &get_cell_val($src_sheet,$row,$col); # Check for disconnected circuits if ($cell_val =~ /disco/i) { print "$dlr_num has been disconnected\n" if $DEBUG; last SHEET; } # Start looking for related Order Numbers if ($cell_val =~ /Circuit Layout Data/i) { print "Entering Circuit Layout Data\n" if $DEBUG; while ($row <= 100) { my $cell_val2 = &get_cell_val($src_sheet,$row,2); if ($cell_val2 =~ /DWDM/i) { $cell_val2 =~ s/^.*(DWDM....).*$/$1/i; print "We have a WINNER!!! - DLR # $dlr_num re +ferences $cell_val2\n\n" if $DEBUG; } $row++; } } } } }

Does anyone have a suggested work-around or fix for this? Or is there a problem with my code that I'm not seeing?

Thanks,

Jack

Replies are listed 'Best First'.
Re: A different kind of Spreadsheet::ParseExcel memory problem.
by philcrow (Priest) on Sep 09, 2005 at 17:51 UTC
    A quick look with Data::Dumper shows me that Spreadsheet::ParseExcel has circular links in its structures. Namely, the S::P::Cell objects have a Format key which refers the parent object. Thus, if those links are not weakened, the structure will not be released properly when the objects go out of scope.

    To fix this would probably require a patch involving the use of weaken in functions that make S::P::Cell objects, or DESTROY methods which do that when the object is about to be garbage collected.

    If a patch is not an option for you, you could carefully fork a new process for each file or break up the task in some other way.

    Phil

      I'm sorry I took so long to respond to your suggestion but it takes a while for my script to run and I wanted to let it finish before I replied. I was very intrigued by your idea. I was planning on implementing forking but not until after I had everything else working. Regardless, I went ahead and put it in. And it worked! 846 files and no 'Out of Memory' errors.

      Thank you very much for the help

      Jack

Re: A different kind of Spreadsheet::ParseExcel memory problem.
by InfiniteSilence (Curate) on Sep 09, 2005 at 18:51 UTC
    Expanding on philcrow's last paragraph I just encapsulated your code in routines:

    #!/usr/bin/perl -w use strict; use Win32::OLE; #or ParseExcel or whatever my $DEBUG = 1; # # Subroutines # sub get_cell_val() { my ($src_sheet, $src_row, $src_col) = @_; if (defined $src_sheet->{Cells}[$src_row][$src_col]) { $_ = $src_sheet->{Cells}[$src_row][$src_col]->Value; } else { $_ = ''; } } # # Main Program # sub ret_files { my ($path)=@_; chdir($path) or die "Cannot change path: $!"; return(glob('*.xls')); } sub main { #... the rest of your code }# end of main 1;

    Got your files into a list:

    C:\Temp>perl -e "require 'crow.pl'; print map{qq|$_\n|} ret_files('.') +;" > do.dat
    Ran your script for each file independently:
    perl -ne "BEGIN{require 'crow.pl'}; main($_)" do.dat

    If you really want to make things simple you can copy the above command line instructions into a batch file (.bat) and re-run it whenever you want. Update: Oh yeah, you will want to flush your output so that you can see those debug statements early on if you are going to use this for hundreds of files.

    Celebrate Intellectual Diversity