Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
For some spreadsheets this can take almost a minute if they have hunreds of columns and thousands of rows and dozens of sheets. When I have thousands of documents, ...

Corion already has you well covered on your specific API question, so let me instead go off on a slightly tangential algorithmics discussion to maybe get you thinking out of the box (cell?) on this one:

While you may well be able to effect some (possibly significant) linear improvements to such a search, that's all you'll ever be able to do under your current algorithm. Consider that your algorithm is proportional to cols*rows*sheets*files*avg(length(cell)), which, given your numbers (above) conservatively puts the number of comparisons into the hundreds of billions. "Proportional to" implies the rather large constant (or maybe even worse?) overhead hidden in $Sheet->Cells($row, $col)->Value. I am surprised it runs as quickly as it does.

In other words, it would really help to know what sort of larger task this Excel reading problem fits into. Looking at all of the factors in the complexity, are there any you can optimize away? For example, maybe you're scanning all of the columns but only using a few. Two orders of magnitude right there, if you can identify those columns before processing the whole file (headers? Pattern match the first few rows?). Or (and this is the part I'm personally struggling to find a rationale for), how are thousands of hundreds-of-columns, dozens-of-sheets Excel files ever the best storage solution? Is the information relational (SQL)? Unfortunately at this point, the cost of translating the Excel files to a more efficient format is probably at least as bad as what you're trying to do, so it may not be a win unless you (ever!) need to run your algorithm more than once. However, if you can automate a .csv export from whatever program created these files in the first place (assuming you didn't pay for three calendar centuries of data entry), that might be a significant improvement.1

As you're already looking to optimize performance, I reckon there's a decent chance you'll need to do this operation more than once. If so, there may be efficiencies you can build in, such as indexing, only re-processing changed files, etc., to eliminate a lot of that penalty. This sort of task might lend itself quite well to parallel processing, even on a single machine. Benchmarking strongly recommended, obviously.

You may well have considered many/most/all of these things, but then again, I don't think there's a monk here who has never had the blinders on at some point when trying to finish a project.

And I cannot use any CPAN.

Why is that? You're already using Win32::OLE::Const, which is not core. This good old link might come in handy: Yes, even you can use CPAN

Hope this helps!

1. .csv is quite possibly cheaper full-stop, but also cheaper in the sense that if your matches are relatively sparse (< ~50%), you can win by processing files line-wise, rather than cell-wise. If the line doesn't match, move on. Otherwise, only then do you need to parse and match individual cells.

In reply to Re: Can't Find my way in Excel by rjt
in thread Can't Find my way in Excel by ExReg

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (3)
    As of 2019-12-12 05:19 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found