Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Syntactic Confectionary Delight
 
PerlMonks

Reducing the memory usage of Spreadsheet::ParseExcel

by jmcnamara (Monsignor)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Aug 03, 2004 at 16:56 UTC ( #379743=snippet: print w/ replies, xml ) Need Help??

Description: A simple technique to help reduce the memory used by Spreadsheet::ParseExcel.

Spreadsheet::ParseExcel processes Excel files in two stages. In the first stage it extracts the Excel binary stream from the OLE container file using OLE::Storage_Lite. In the second stage it parses the binary stream to read workbook, worksheet and cell data which it then stores in memory. Of the second stage storage by far the largest amount is taken up by the cell data.

As each cell is encountered a cell handling function creates a relatively large nested cell object that contains the cell value and all of the data that relates to the cell formatting. For large files (a 10MB Excel file on a 256MB system) this overhead can cause the system to grind to a halt.

However, in a lot of cases the only information that is required is the cell value. In these cases it is possible to avoid most of the memory overhead by specifying your own cell handling function and by telling Spreadsheet::ParseExcel not to store the parsed cell data. Here is an example.


    #!/usr/bin/perl -w

    use strict;
    use Spreadsheet::ParseExcel;

    my $parse_excel = Spreadsheet::ParseExcel->new(
                                                CellHandler => \&cell_
+handler,
                                                NotSetCell  => 1
                                                  );

    my $workbook = $parse_excel->Parse('file.xls');


    sub cell_handler {

        my $workbook    = $_[0];
        my $sheet_index = $_[1];
        my $row         = $_[2];
        my $col         = $_[3];
        my $cell        = $_[4];

        # Do something useful with the formatted cell value
        print $cell->{_Value}, "\n";

    }
The user specified cell handler is passed as a code reference to new() along with the parameter NotSetCell which tells Spreadsheet::ParseExcel not to store the parsed cell.

The cell handler is passed 5 arguments. The first, $workbook, is a reference to the Spreadsheet::ParseExcel::Workbook object that represent the parsed workbook. This can be used to access any of the parsed workbook global data. The second $sheet_index is the zero-based index of the worksheet being parsed. The third and fourth are the zero-based row and column number of the cell. The fifth, $cell, is a reference to the Spreadsheet::ParseExcel::Cell object. This can be used extract the data from the cell.

Each cell contains an unformatted value, $cell->{Val} and a formatted (by Spreadsheet::ParseExcel) value, $cell->{_Value}. In a majority of cases the latter value is the one that is required.

If you don't want all of the data in the spreadsheet you can add some checks to the cell handler. The following example only prints the first 10 rows of the first two worksheets in the parsed workbook.


    #!/usr/bin/perl -w

    use strict;
    use Spreadsheet::ParseExcel;

    my $parse_excel = Spreadsheet::ParseExcel->new(
                                                CellHandler => \&cell_
+handler,
                                                NotSetCell  => 1
                                                  );

    my $workbook = $parse_excel->Parse('file.xls');


    sub cell_handler {

        my $workbook    = $_[0];
        my $sheet_index = $_[1];
        my $row         = $_[2];
        my $col         = $_[3];
        my $cell        = $_[4];

        # Skip some worksheets and rows (inefficiently).
        return if $sheet_index >= 3;
        return if $row         >= 10;

        # Do something with the formatted cell value
        print $cell->{_Value}, "\n";

    }

However, this still processes the entire workbook. If you wish to save some time you can abort the parsing after you have read the data that you want:

    #!/usr/bin/perl -w

    use strict;
    use Spreadsheet::ParseExcel;

    my $parse_excel = Spreadsheet::ParseExcel->new(
                                                CellHandler => \&cell_
+handler,
                                                NotSetCell  => 1
                                                  );

    my $workbook = $parse_excel->Parse('file.xls');


    sub cell_handler {

        my $workbook    = $_[0];
        my $sheet_index = $_[1];
        my $row         = $_[2];
        my $col         = $_[3];
        my $cell        = $_[4];

        # Skip some worksheets and rows (more efficiently).
        if ($sheet_index >= 1 and $row >= 10) {
            $workbook->ParseAbort(1);
            return;
        }

        # Do something with the formatted cell value
        print $cell->{_Value}, "\n";

    }

In general this technique is useful if you are writing an Excel to database filter since you can put your DB calls in the cell handler.

Comment on Reducing the memory usage of Spreadsheet::ParseExcel
Select or Download Code
Re: Reducing the memory usage of Spreadsheet::ParseExcel
by knoebi (Friar) on Aug 03, 2004 at 17:12 UTC
    First thing i do tomorow: ++ !

    Great Post! I have just wrote a excel-to-db interface which has to read excel files up to 30 mb! we have a 4 GB machine, so this still works but my coworkers get pissed off.

    ciao, knoebi

[reply]
Re: Reducing the memory usage of Spreadsheet::ParseExcel
by dragonchild (Archbishop) on Aug 03, 2004 at 17:29 UTC
    Would it be possible to convert the data structures used to ones that are less memory-consuming? For example, I know you use hashrefs. What about using arrayrefs and constants for the indices? That should reduce memory usage by about 15%. Also, you could move to an inside-out objects type of structure, which would help even more here because you have lots of little structures.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

[reply]

      Valid points but just to clarify, Takanori Kawai wrote and maintains Spreadsheet::ParseExcel. I am the positron to his electron. :-)

      --
      John.

[reply]
        *blushes* I saw one module and responded about another. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

[reply]
Re: Reducing the memory usage of Spreadsheet::ParseExcel
by Anonymous Monk on Sep 01, 2004 at 13:21 UTC
    I tried your example and I am still not able to parse the 7 MB excel spreadsheet I need to parse. Anyone know of any other workarounds? Thanks!
[reply]
      Hey John,

      I read your post about the reducing memory usage by avoiding the formating data. In fact, thatīs a great idea, and I really donīt want the formating data in these worksheets. But Iīm having trouble in fitting the code into mine. Can you or any of the fellow monks take a look?

      Do I have to replace the names of the variables of the sub cell_handler?

      #!/usr/bin/perl -w use CGI qw( :standard); use Spreadsheet::ParseExcel; # The new objects my $q = new CGI; my $oExcel = Spreadsheet::ParseExcel->new( CellHandler => \&cell_handl +er, NotSetCell => 1 ); # McNamara's alternative cell handler sub cell_handler { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; print $cell->{Val} , "<br>"; } # The header print $q->header( -type => "text/html"); # Starting the cronometer $start = (times) [0]; # The reading # my $file = "z4kjasdf.xls"; my $file = "acervo15k.xls"; # my $file = "teste1.xls"; my $oBook = $oExcel->Parse($file); my($iR, $iC, $oWkS, $oWkC); my @thisrow; # Just the first worksheet... $oWkS = $oBook->{Worksheet}[0]; for ( $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS- +>{MaxRow}; $iR++) { @thisrow = (); for ( $iC = $oWkS->{MinCol}; defined $oWkS->{MaxCol} && $iC <= $oWk +S->{MaxCol}; $iC++) { $oWkC = $oWkS->{Cells}[$iR][$iC]; if ($oWkC) { push (@thisrow, $oWkC->{Val}); } } my ($col1, $col2, $col3, $col4, @othercols) = @thisrow; print "$col1 <br>"; } # Ending the cronometer $end = (times) [0]; print "<b>$iR rows read in " , $end-$start , " seconds";
      Thanks!

      André

[reply]
[d/l]

Back to Snippets Section

Login:
Password
remember me
What's my password?
Create A New User

Node Status
node history
Node Type: snippet [id://379743]
help
Community Ads
Chatterbox
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users
Others surveying the Monastery: (10)
GrandFather
wfsp
atcroft
herveus
Eyck
clinton
djp
vishi83
gnosti
darkman0101
As of 2009-11-21 09:48 GMT
Sections
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth

Future historians will find that the material characteristic of the current era is...

Aluminium
Plastic
Oil
Water
Carbon dioxide
Copper
Iron
Silicon
Salt
Uranium
Hydrogen
Other

Results (729 votes), past polls