Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Can't Find my way in Excel

by ExReg (Priest)
on Sep 17, 2019 at 13:02 UTC ( #11106304=perlquestion: print w/replies, xml ) Need Help??

ExReg has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for ways to speed up the searching for words in Excel. Currently, I simply loop through the used cells range and examine each cell's contents. For some spreadsheets this can take almost a minute if they have hunreds of columns and thousands of rows and dozens of sheets. When I have thousands of documents, this is too long.

I had a similar problem in Word where I had to search thousands of paragraphs in thousands of documents. What I did when I had only words to search for (not regexen) is use Word's Find function. This sped things up by an order of magnitude. I was wondering if anyone knew the syntax for Excel Finds. It is not the same as Word's. I have tried to translate the VBA that does work into Perl, but have failed thus far.

I currently search with something like this:

use strict; use Win32::OLE; use Win32::OLE::Const 'Microsoft Excel'; my $Excel = Win32::OLE->new('Excel.Application','Quit') or die "Cannot + open Excel\n"; my $Book = $Excel->workbooks->Open(FileName => "foo.xlsx", ReadOnly => + 1) or die "Cannot open Spreadsheet\n"; my $num_sheets = $Book->Sheets->Count; for my $sheet_num ( 1 .. $num_sheets ) { my $Sheet = $Book->worksheets( $sheet_num ); my $max_col = $Sheet->UsedRange->SpecialCells( xlCellTypeLastCell +)->Column; my $max_row = $Sheet->UsedRange->SpecialCells( xlCellTypeLastCell +)->Row; for my $row ( 1 .. $max_row ) { for my $col ( 1 .. $max_col ) { print "I found it in Sheet $sheet_num Cell $row,$col\n" if + ( $Sheet->Cells( $row, $col )->Value =~ /cabbage/ ); } } }

The code that uses Excel's Find in VBA is

Set Loc = ThisWorkBook.Sheets(n).UsedRange.Cells.Find(What:="cabbage")

Any Ideas? I have tried things like my $loc = $Sheet->UsedRange->Cells->Find({What}=>"cabbage") and variations to no avail.

As always, I hope I have typed this in correctly. Please excuse any typos. And I cannot use any CPAN.

Replies are listed 'Best First'.
Re: Can't Find my way in Excel
by Corion (Pope) on Sep 17, 2019 at 13:07 UTC

    Have you tried

    $loc = $Sheet->UsedRange->Cells->Find( { What =>"cabbage" })

    The Win32::OLE documentation suggests passing a hashref when you want to use named parameters.

    Also, did you run your code with warnings or the -w command line switch? Perl will tell you about problems in your code if you let it.

    Update: Also try:

    $loc = $Sheet->UsedRange->Cells->Find( What =>"cabbage" )

      I have tried that. It errors with "Undefined subroutine &main...". $Sheets, $Sheets->UsedRange, and $Sheets->UsedRange->Cells all return Win32 hashes, but I can't get that last step with Find.

        Please tell us the exact and complete error message. Otherwise, we can't know what exactly goes wrong with your script.

Re: Can't Find my way in Excel
by FreeBeerReekingMonk (Deacon) on Sep 17, 2019 at 15:35 UTC
    Here is a fresh point of view:

    0. Before a query, find all your xls files

    1. Keep a stored hash of MD5 checksum's and filenames. This will tell you if a xls has changed and it's data needs to be re-extracted.

    2. For each new Excel file not yet with MD5 (or with different checksum), convert the excel to xlsx and extract the ./xl/sharedStrings.xml (which contains all the texts from all tabs)

    3. Grep inside all sharedStrings.xml for faster response times (adjust your grep parameters, like grep -l)

    For example, Libreoffice has:

    libreoffice --headless --convert-to
    If you do not have access to install a current version of LibreOffice then there are macros in VBS (XlsToCsv.vbs on S.O.) that output to multiple csv files.

    On Linux there might also be unoconv in your repository.

    oh, and if you give up rolling your own implementation, why not use a ready-made solution: DocFetcher

Re: Can't Find my way in Excel
by rjt (Curate) on Sep 17, 2019 at 14:52 UTC
    For some spreadsheets this can take almost a minute if they have hunreds of columns and thousands of rows and dozens of sheets. When I have thousands of documents, ...

    Corion already has you well covered on your specific API question, so let me instead go off on a slightly tangential algorithmics discussion to maybe get you thinking out of the box (cell?) on this one:

    While you may well be able to effect some (possibly significant) linear improvements to such a search, that's all you'll ever be able to do under your current algorithm. Consider that your algorithm is proportional to cols*rows*sheets*files*avg(length(cell)), which, given your numbers (above) conservatively puts the number of comparisons into the hundreds of billions. "Proportional to" implies the rather large constant (or maybe even worse?) overhead hidden in $Sheet->Cells($row, $col)->Value. I am surprised it runs as quickly as it does.

    In other words, it would really help to know what sort of larger task this Excel reading problem fits into. Looking at all of the factors in the complexity, are there any you can optimize away? For example, maybe you're scanning all of the columns but only using a few. Two orders of magnitude right there, if you can identify those columns before processing the whole file (headers? Pattern match the first few rows?). Or (and this is the part I'm personally struggling to find a rationale for), how are thousands of hundreds-of-columns, dozens-of-sheets Excel files ever the best storage solution? Is the information relational (SQL)? Unfortunately at this point, the cost of translating the Excel files to a more efficient format is probably at least as bad as what you're trying to do, so it may not be a win unless you (ever!) need to run your algorithm more than once. However, if you can automate a .csv export from whatever program created these files in the first place (assuming you didn't pay for three calendar centuries of data entry), that might be a significant improvement.1

    As you're already looking to optimize performance, I reckon there's a decent chance you'll need to do this operation more than once. If so, there may be efficiencies you can build in, such as indexing, only re-processing changed files, etc., to eliminate a lot of that penalty. This sort of task might lend itself quite well to parallel processing, even on a single machine. Benchmarking strongly recommended, obviously.

    You may well have considered many/most/all of these things, but then again, I don't think there's a monk here who has never had the blinders on at some point when trying to finish a project.

    And I cannot use any CPAN.

    Why is that? You're already using Win32::OLE::Const, which is not core. This good old link might come in handy: Yes, even you can use CPAN

    Hope this helps!

    1. .csv is quite possibly cheaper full-stop, but also cheaper in the sense that if your matches are relatively sparse (< ~50%), you can win by processing files line-wise, rather than cell-wise. If the line doesn't match, move on. Otherwise, only then do you need to parse and match individual cells.

      Thanks for the thoughts. It is not a case of how data is stored. It is a matter of looking for things in existing Excel spreadsheets. I agree that handling the data in another format would vastly simplify the search process. But I am stuck looking at Excel spreadsheets in .xls or .xlsx format. And no, I absolutely cannot do CPAN.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11106304]
Approved by toolic
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2020-09-26 02:27 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (141 votes). Check out past polls.