Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Reducing the memory usage of Spreadsheet::ParseExcel

by jmcnamara (Monsignor)
on Aug 03, 2004 at 16:56 UTC ( [id://379743]=CUFP: print w/replies, xml ) Need Help??

A simple technique to help reduce the memory used by Spreadsheet::ParseExcel.

Spreadsheet::ParseExcel processes Excel files in two stages. In the first stage it extracts the Excel binary stream from the OLE container file using OLE::Storage_Lite. In the second stage it parses the binary stream to read workbook, worksheet and cell data which it then stores in memory. Of the second stage storage by far the largest amount is taken up by the cell data.

As each cell is encountered a cell handling function creates a relatively large nested cell object that contains the cell value and all of the data that relates to the cell formatting. For large files (a 10MB Excel file on a 256MB system) this overhead can cause the system to grind to a halt.

However, in a lot of cases the only information that is required is the cell value. In these cases it is possible to avoid most of the memory overhead by specifying your own cell handling function and by telling Spreadsheet::ParseExcel not to store the parsed cell data. Here is an example.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Do something useful with the formatted cell value print $cell->{_Value}, "\n"; }
The user specified cell handler is passed as a code reference to new() along with the parameter NotSetCell which tells Spreadsheet::ParseExcel not to store the parsed cell.

The cell handler is passed 5 arguments. The first, $workbook, is a reference to the Spreadsheet::ParseExcel::Workbook object that represent the parsed workbook. This can be used to access any of the parsed workbook global data. The second $sheet_index is the zero-based index of the worksheet being parsed. The third and fourth are the zero-based row and column number of the cell. The fifth, $cell, is a reference to the Spreadsheet::ParseExcel::Cell object. This can be used extract the data from the cell.

Each cell contains an unformatted value, $cell->{Val} and a formatted (by Spreadsheet::ParseExcel) value, $cell->{_Value}. In a majority of cases the latter value is the one that is required.

If you don't want all of the data in the spreadsheet you can add some checks to the cell handler. The following example only prints the first 10 rows of the first two worksheets in the parsed workbook.

#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (inefficiently). return if $sheet_index >= 3; return if $row >= 10; # Do something with the formatted cell value print $cell->{_Value}, "\n"; }
However, this still processes the entire workbook. If you wish to save some time you can abort the parsing after you have read the data that you want:
#!/usr/bin/perl -w use strict; use Spreadsheet::ParseExcel; my $parse_excel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_ +handler, NotSetCell => 1 ); my $workbook = $parse_excel->Parse('file.xls'); sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Skip some worksheets and rows (more efficiently). if ($sheet_index >= 1 and $row >= 10) { $workbook->ParseAbort(1); return; } # Do something with the formatted cell value print $cell->{_Value}, "\n"; }

In general this technique is useful if you are writing an Excel to database filter since you can put your DB calls in the cell handler.

Replies are listed 'Best First'.
Re: Reducing the memory usage of Spreadsheet::ParseExcel
by dragonchild (Archbishop) on Aug 03, 2004 at 17:29 UTC
    Would it be possible to convert the data structures used to ones that are less memory-consuming? For example, I know you use hashrefs. What about using arrayrefs and constants for the indices? That should reduce memory usage by about 15%. Also, you could move to an inside-out objects type of structure, which would help even more here because you have lots of little structures.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested


      Valid points but just to clarify, Takanori Kawai wrote and maintains Spreadsheet::ParseExcel. I am the positron to his electron. :-)

      --
      John.

        *blushes* I saw one module and responded about another. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Reducing the memory usage of Spreadsheet::ParseExcel
by knoebi (Friar) on Aug 03, 2004 at 17:12 UTC
    First thing i do tomorow: ++ !

    Great Post! I have just wrote a excel-to-db interface which has to read excel files up to 30 mb! we have a 4 GB machine, so this still works but my coworkers get pissed off.

    ciao, knoebi

Re: Reducing the memory usage of Spreadsheet::ParseExcel
by Anonymous Monk on Sep 01, 2004 at 13:21 UTC
    I tried your example and I am still not able to parse the 7 MB excel spreadsheet I need to parse. Anyone know of any other workarounds? Thanks!
      Hey John,

      I read your post about the reducing memory usage by avoiding the formating data. In fact, thatīs a great idea, and I really donīt want the formating data in these worksheets. But Iīm having trouble in fitting the code into mine. Can you or any of the fellow monks take a look?

      Do I have to replace the names of the variables of the sub cell_handler?

      #!/usr/bin/perl -w use CGI qw( :standard); use Spreadsheet::ParseExcel; # The new objects my $q = new CGI; my $oExcel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_handl +er, NotSetCell => 1 ); # McNamara's alternative cell handler sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; print $cell->{Val} , "<br>"; } # The header print $q->header( -type => "text/html"); # Starting the cronometer $start = (times) [0]; # The reading # my $file = "z4kjasdf.xls"; my $file = "acervo15k.xls"; # my $file = "teste1.xls"; my $oBook = $oExcel->Parse($file); my($iR, $iC, $oWkS, $oWkC); my @thisrow; # Just the first worksheet... $oWkS = $oBook->{Worksheet}[0]; for ( $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS- +>{MaxRow}; $iR++) { @thisrow = (); for ( $iC = $oWkS->{MinCol}; defined $oWkS->{MaxCol} && $iC <= $oWk +S->{MaxCol}; $iC++) { $oWkC = $oWkS->{Cells}[$iR][$iC]; if ($oWkC) { push (@thisrow, $oWkC->{Val}); } } my ($col1, $col2, $col3, $col4, @othercols) = @thisrow; print "$col1 <br>"; } # Ending the cronometer $end = (times) [0]; print "<b>$iR rows read in " , $end-$start , " seconds";
      Thanks!

      André

Re: Reducing the memory usage of Spreadsheet::ParseExcel
by runrig (Abbot) on Jun 07, 2012 at 23:21 UTC
    And a Patch is now available for Spreadsheet::XLSX to do the same thing! (oops, ParseAbort not implemented yet as of this post. Update: Now it is!). Update2: Don't celebrate too much. Due to the method of reading the data, it actually still reads an entire worksheet into memory.

    Update: Bah, forget this...just hang on and wait for jmcnamara's Excel::Reader::XLSX.

Re: Reducing the memory usage of Spreadsheet::ParseExcel
by joeperl (Acolyte) on Feb 24, 2010 at 18:46 UTC

    hi john
    what about the sheet name? how to get it in your approach?

    regards,
    joe


      Once you have access to the $workbook object you can access the worksheet objects and then the sheet name. Something like this:
      sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; my $worksheet = $workbook->worksheet($sheet_index); my $sheetname = $worksheet->get_name(); print $sheetname, ": ", $cell->{_Value}, "\n"; }
      See the main Spreadshhet::PraseExcel documentation for more details on these methods.

      In a real world situation you would probably check to see if $sheet_index is the same as a stored previous value and print the sheet name if it isn't.

      One thing to look out for is that the callback will only get called for sheets that contain data. So the sheet name of an empty worksheet wouldn't be printed with the above scheme. If this is a use case that you are interested in then you could also handle it by examining the sheet indices.

      --
      John.

        Hi John,
        I did the same kind of thing before seeing your reply, except that i passed the $workbook,$sheet_index as a argument to another function where i found out the sheet name using the same logic..
        but your approach is the simplest and apt one i was looking for...
        guess i've a knack to complicate simple things.. :)

        Thanks a lot for educating me..

        Regards,
        joe

Re: Reducing the memory usage of Spreadsheet::ParseExcel
by joeperl (Acolyte) on Mar 07, 2010 at 11:33 UTC

    hi john,
    Here im again with another question.. :)
    Im parsing an excel sheet where there are hex address values, which are calculated through formulae based on the values of other cells..
    One strange thing i noticed is that while parsing hex values which have alphabets (A,B,C,D,E) i get the values as as they are, but while parsing values with out alphabets (like 00408000) i get the values as 0040 8000..
    Why is that a space is added in the value ? howw to workaround this issue? also the value 00000000 is read as just 0.. is there a way where i can read it as 32 bit address 00000000 ?


      The best thing to do is to send an example spreadsheet to the email address in the Spreadsheet::WriteExcel docs and I'll have a look at it.

      You can reduce the file down to a few example cells if you wish.

      --
      John.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://379743]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-03-19 06:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found