Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Spreadsheet::WriteExcel large files

by mrguy123 (Hermit)
on Jan 02, 2012 at 08:29 UTC ( #945859=perlquestion: print w/replies, xml ) Need Help??
mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I am using Spreadsheet::WriteExcel to create a file with 20,000 lines. The size of the file is quite large - over 100 megabytes.
However, when I open the file and save it as a regular Excel file, the file size goes down to 6 megabytes!
I am using fairly standard features (added some of my code just in case)
for my $link (@links){ my @fields = split('\t', $link); ##Print to xls my $col = 0; foreach my $token (@fields) { $worksheet->write($row, $col, $token); $col++; } $row++; }
Any idea why this happens and how I can make the original file smaller?
Thanks, MrGuy

Bug: An aspect of a computer program which exists because the programmer was thinking about Jumbo Jacks or stock options when he wrote the program.

Replies are listed 'Best First'.
Re: Spreadsheet::WriteExcel large files (text versus binary format)
by BrowserUk (Pope) on Jan 02, 2012 at 10:55 UTC
    I am using Spreadsheet::WriteExcel to create a file with 20,000 lines. The size of the file is quite large - over 100 megabytes.

    However, when I open the file and save it as a regular Excel file, the file size goes down to 6 megabytes!

    ...

    Any idea why this happens ...

    The difference is simply that between the text formatting of your input file and the binary data format of .xls files.

    Your input is ascii-encoded tab separated values. 1 million records each of 100 small integers takes nearly 400MB:

    C:\test>perl -le"print join chr(9), map int( rand 1000 ), 1 .. 100 for + 1 .. 1e6" > junk.dat C:\test>dir junk.dat 02/01/2012 09:05 390,001,724 junk.dat

    Conversely, that same data stored as binary takes just 2MB:

    C:\test>perl -E"binmode STDOUT; print pack 'S', map int( rand 1000 ), +1 .. 100 for 1 .. 1e6" >junk.dat C:\test>dir junk.dat 02/01/2012 09:06 2,000,000 junk.dat

    Mind you, compared with the latest file format used for such data, that ridiculous 200:1 size ratio is positively sane. The same data stored as XML:

    #! perl -slw use strict; print '<file rows="1000000" columns="100" >'; for my $row ( 1 .. 1e6 ) { print qq[<row id="$row">]; print join '', map qq[<col id="$_">${ \int( rand 1000 )}</col>], 1 + .. 100; print qq[</row>]; } print '</file>'; __END__ C:\test>junkxml >junk.xml C:\test>head junk.xml <file rows="1000000" columns="100" > <row id="1"> <col id="1">897</col><col id="2">421</col>... "24">753</col><col id="25">707</col><col i... id="47">195</col><col id="48">889</col><c... ><col id="70">450</col><col id="71">779</c... ol><col id="93">769</col><col id="94">90</... </row> ... C:\test>dir junk.xml 02/01/2012 09:45 2,206,994,460 junk.xml

    Takes up an absolutely ludicrous 2.2 Gigabytes or over 1000 times as much space to hold the exact same data!

    Of course, the anti-binary file lobby and XML advocates will tell you that "binary files can't be manually inspected and edited" the way text files can. But really? Who is going to manually verify or edit 2 million numbers? Let alone do so whilst wading their way through the 2.2GB of pointless, repetitive, eye-crossing verbosity that is XML.

    And then there is that other stand-by defence: "Disk space is cheap". Which is true. But IO transfer time and encoding/decoding costs definitely aren't.

    They'll also say that text formats compress with much higher compression ratios than binary data, which they do. But that completely misses the point that even compressed with the very best algorithms, these text formats still require 100s of times more space than the uncompressed binary data.

    The text version:

    C:\test>dir junk.dat 02/01/2012 09:14 389,997,094 junk.dat C:\test>bzip2 -9 junk.dat C:\test>dir junk.dat.bz2 02/01/2012 09:14 131,705,924 junk.dat.bz2

    And the XML version:

    C:\test>dir junk.xml 02/01/2012 09:45 2,206,994,460 junk.xml C:\test>bzip2 -9 junk.xml C:\test>dir junk.xml.bz2 02/01/2012 09:45 216,299,791 junk.xml.bz2

    Sure, a 10:1 compression rate is amazing, until you realise that the result still takes up 100 times as much space as the raw binary data. And it will therefore take 100 times as long to read from disk. Not to mention the 7 minutes of CPU it took to compress it; and the 2 minutes of CPU it'll take to decompress it and another 2 minutes of CPU it'll take to parse the XML; in order to get back the 2MB of data it contains that can be loaded from disk in less than 1 second.

    So the answer to your question: "how I can make the original file smaller?" is, store it in binary format and save time, hassle, disk and cpu.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thank you for your answer and enlightenment about the difference between ascaii and binary encoding.
      I will try going in that direction
      MrGuy
      What is confusing for me is that both the 6 Mega file and the 100 Mega file are Microsoft office documents that are not ASCAII encoded.
      guyn@il-kblwe02>ls -1sh total 153M 6.1M linking_optimization_results_small.xls 47M linking_optimization.txt 100M linking_optimization.xls guyn@il-kblwe02>file linking_optimization_results_small.xls linking_optimization_results_small.xls: Microsoft Office Document guyn@il-kblwe02>file linking_optimization.xls linking_optimization.xls: Microsoft Office Document
      Also, they have the same amount of lines
      guyn@il-kblwe02>wc -l linking_optimization.xls + + 304562 linking_optimization.xls guyn@il-kblwe02>wc -l linking_optimization_results_small.xls 299554 linking_optimization_results_small.xls
      So why is one so much larger than the other?
        What is confusing for me is that both the 6 Mega file and the 100 Mega file are Microsoft office documents that are not ASCAII encoded.... So why is one so much larger than the other?

        My guess is that the smaller contains just the results, whereas the the larger contains the formulae used to derive those results. But that is only a guess.

        Also, they have the same amount of lines

        Using wc -l on binary files is not useful. It only tells you how many bytes with the value 13 decimal it contains. But those bytes are probably not newlines but rather just bytes within packed binary values that happen to look like newlines.

        I would have thought your simplest option would be to open each of the files using Excel (or other program that can read .xls files) and inspect what they each contain.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Spreadsheet::WriteExcel large files
by jmcnamara (Monsignor) on Jan 02, 2012 at 17:54 UTC

    The most likely explanation for the size difference is the file format after the file has been saved is. Is the 6 MB file in xls or xlsx format?

    Excel's xls binary BIFF format doesn't use compression. However, it does use some space optimizations which Spreadsheet::WriteExcel doesn't include so in general a file re-saved in xls format by Excel may be 10-20% smaller, depending on the data it contains and the arrangement of the data.

    Excel's xlsx format is compressed XML data so it can be significantly smaller than the equivalent xls file even if both are created by Excel.

    If you would like to generate files in the xlsx format try Excel::Writer::XLSX. It is API compatible with Spreadsheet::WriteExcel so you should be able to just change the module names and re-run your program. If you do, let us know what the xlsx output file size is.

    Here is some sample data based on the output of the bigfile.pl example program:

    bigfile1.xls 10,379,264 # Original from Spreadsheet::WriteExcel. + 100% bigfile2.xls 10,525,184 # Resaved by Excel in xls format. + 102% bigfile3.xlsx 2,733,546 # Resaved by Excel in xlsx format. + 27% bigfile4.xlsx 2,325,559 # Original from Excel::Writer::XLSX. + 23% bigfile5.zip 1,888,573 # bigfile1.xls zipped for comparison. + 18%

    Results for other files may vary significantly depending on the ratio of string/number data and some other factors.

    --
    John.

      Thanks for your answer
      I re-saved the files as XLS (not XLSX) so this isn't the problem.
      The funny thing is, I just ran the bigfile.pl example you mentioned, and when I resaved the file there was no compression!
      This means I must have done something different, but I'm not sure what
      Will investigate and update if I find something
      Guy
Re: Spreadsheet::WriteExcel large files
by CountZero (Bishop) on Jan 02, 2012 at 16:42 UTC
    I think Spreadsheet::WriteExcel does not use any compression on the data it writes to the Excel-file but opening and saving the Excel file with the Excel application compresses the data.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Spreadsheet::WriteExcel large files
by Anonymous Monk on Jan 02, 2012 at 08:32 UTC

    Compression is compression :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://945859]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2018-11-17 15:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (205 votes). Check out past polls.

    Notices?