good chemistry is complicated,
and a little bit messy -LW
Efficient regex search on array tableby Polyglot (Chaplain)
|on Dec 14, 2022 at 17:05 UTC
Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
I feel like I may be asking for the impossible, but seeing as there are many programmers here of superior ability, I am hoping there are significant improvements that I would never have thought of. (And speed is a rare need for me, as most of my scripts are one-time-use or quick enough that efficiency is of minor concern.)
I'm searching on what amounts to a table of 30,000+ lines, with from one to five "columns." In Perl, each of these "columns" is an array (list) where each array has an identical number of rows/items. The search applies a regular expression to each column in each row, and the expression may differ.
For example, suppose we have a table like this:
And queries like this:
This is just an English example to help understand the situation. The actual "columns" may represent different languages, i.e. translations, of the same thing, and each language/column will be searched with its own regular expression.
The data coming in to the subroutine includes the regex to use for each column (five--but if some are empty, that column does not need handling), the array for each column (which is skipped/empty if no regex for that column was provided), and whether or not the regex should match OR NOT match (boolean values for each column).
The expected results from the query would be the values of the first column for rows 1 and 3. The central row would not match. Only the first column's values get returned, the other columns are merely for comparison purposes--looking for similarities or contrasts to the "original" (main) column.
Here are the important bits of my code (abridged for better focus and readability):
As you may notice, I have attempted to exit early from loops that are found to be no longer necessary. If any single column fails to match its regex, the entire row will fail--so there is no need to extensively test the remaining columns. I have also pre-established my regular expressions, though I am not sure if there is a way to improve this. The expression used for each column will remain the same for all rows in the table.
At present, the average return, if using a regular expression of moderate complexity, is between 45 and 65 seconds, and in checking the processing times for various segments, this time is mostly (98%) concentrated in the regex-matching section. If it is strictly a "match"/"no match" for a simple word, without any regex alternations or other complexities involved, I have seen as little as 11 or so seconds. Even that seems a little bit longer than I wish, seeing as the results will be returned to the client's browser.
Can this be streamlined any more?