Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Don't ask to ask, just ask
 
PerlMonks  

Speeding up Spreadsheet::XLSX file load in UNIX

by ketanh (Novice)
on Jul 04, 2011 at 23:57 UTC ( #912728=perlquestion: print w/ replies, xml ) Need Help??
ketanh has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Spreadsheet::XLSX to parse some reasonably big excel workbooks. They're in the ~5MB range.

I'm not really having any trouble with the reading/writing itself, but just opening up the .xlsm into perl takes ~30seconds. To compare, the same workbook opens with win32::ole in ~4seconds on my windows machine. I'm using the commands as specified by the CPAN Spreadsheet::XLSX module description.

use Text::Iconv; my $converter = Text::Iconv -> new ("utf-8", "windows-1251"); use Spreadsheet::XLSX; my $InFileName = shift @ARGV or die "no input file specified"; chomp (my $excelLocS = $InFileName); my $excel = Spreadsheet::XLSX -> new ($InFileName) or die "file does n +ot exist";

I did verify that the speed has a dependency on how much data there is in the workbook. I've wrapped time commands around the one line my "$excel =.." and made sure that's the bottleneck.

Would appreciate advice on what (if anything) I could do different here to speed up the initial load of these bigger files.

Comment on Speeding up Spreadsheet::XLSX file load in UNIX
Download Code
Re: Speeding up Spreadsheet::XLSX file load in UNIX
by Anonymous Monk on Jul 05, 2011 at 00:22 UTC

    Would appreciate advice on what (if anything) I could do different here to speed up the initial load of these bigger files.

    There is nothing you can do

Re: Speeding up Spreadsheet::XLSX file load in UNIX
by Tux (Monsignor) on Jul 05, 2011 at 06:04 UTC

    I think one of the reasons Spreadsheet::XLSX is so slow, is that it doesn't use a proper XML parser, but parses the workbook(s) using regular expressions. And over that, it uses:

    use Archive::Zip; use Spreadsheet::XLSX::Fmt2007; use Data::Dumper; use Spreadsheet::ParseExcel;

    to be Spreadsheet::ParseExcel compatible (which it really is not.

    In most Spreadsheet modules, the whole spreadsheet (file) is read into memory, as there are several formats to be parsed before one can get to the actual data (ZIP, binary, ...). If the spreadsheet would be readable directly from file (like CSV, if you want to call that a spreadsheet), parsing could be a lot faster.

    If someone would (re)write this module using a proper (fast) XML parser, preferably with the option to select whatever (working) XML parser is installed, that would really help this module. I really mean option here, as making the module require XML::libXML would mean its death, as XML::libXML depends on libxml2, which might prove very hard to port on some non-standardish systems. So the module should choose between XML::libXML, XML::Parser, XML::Parser::Lite, XML::Simple, or XML::Twig (and even those might he depending on each other).


    Enjoy, Have FUN! H.Merijn
      If someone would (re)write this module using a proper (fast) XML parser, preferably with the option to select whatever (working) XML parser is installed, that would really help this module

      I plan to write an Excel::Reader::XLSX module once Excel::Writer::XLSX reaches full compatibility with Spreadsheet::WriteExcel (in 2-3 months).

      The main aim will be a parser that is fast and has low memory usage. It will probably be based around XML::Twig.



      --
      John.

        YEAH! jmcnamara++. Once you have a prototype working, I'd like to check it in Spreadsheet::Read (and support it there).


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://912728]
Approved by Tanktalus
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2014-04-19 11:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls