Efficient regex search on array table

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I feel like I may be asking for the impossible, but seeing as there are many programmers here of superior ability, I am hoping there are significant improvements that I would never have thought of. (And speed is a rare need for me, as most of my scripts are one-time-use or quick enough that efficiency is of minor concern.)

I'm searching on what amounts to a table of 30,000+ lines, with from one to five "columns." In Perl, each of these "columns" is an array (list) where each array has an identical number of rows/items. The search applies a regular expression to each column in each row, and the expression may differ.

For example, suppose we have a table like this:

Main	Comp. 1	Comp. 2	Comp. 3	Comp. 4
The red fox jumped over the hollow log.	The vixen jumped over the brown log.	The red fox leaped over the hollow log.	The gray fox jumped over the big log.	The lazy fox did not jump over the log.
The tall tree towered over the animals.	The tree swayed smartly in the field.	The animals looked up to the tall tree.	The tall tree grew in the field.	The towering tree shaded the animals.
The tawny deer jumped over the fence.	The doe jumped over the barrier.	The brown doe leaped over the picket fence.	The red deer cleared the tall fence.	The doe-eyed doe sprang over wall.

And queries like this:

Main		Comp. 1		Comp. 2		Comp. 3		Comp. 4
Match:	(hollow log)\|(fence)	Match:	(jumped)	NOT match:	(vixen)\|(animals)	NOT match:	(red fox)	Match:	(jump)\|(leap)\|(sprang)

This is just an English example to help understand the situation. The actual "columns" may represent different languages, i.e. translations, of the same thing, and each language/column will be searched with its own regular expression.

The data coming in to the subroutine includes the regex to use for each column (five--but if some are empty, that column does not need handling), the array for each column (which is skipped/empty if no regex for that column was provided), and whether or not the regex should match OR NOT match (boolean values for each column).

The expected results from the query would be the values of the first column for rows 1 and 3. The central row would not match. Only the first column's values get returned, the other columns are merely for comparison purposes--looking for similarities or contrasts to the "original" (main) column.

Here are the important bits of my code (abridged for better focus and readability):

sub processComparison {

    # INCOMING COMPARISON-COLUMN DETAILS
    my (
        $table,            #MAIN COLUMN NAME
        $ACCP_searchver1,  #COMPARISON COLUMN NAMES/SELECTIONS
        $ACCP_searchver2,
        $ACCP_searchver3,
        $ACCP_searchver4,
        $ACCP_comp1,    #ORIGINAL USER-ENTERED QUERY
        $ACCP_comp2,
        $ACCP_comp3,
        $ACCP_comp4,
        $ACCP1_regex,   #USER SELECTION FOR REGEX vs. SIMPLE MATCH
        $ACCP2_regex,
        $ACCP3_regex,
        $ACCP4_regex,
        $accpyn1,       #USER SELECTION FOR MATCH/NO MATCH
        $accpyn2,
        $accpyn3,
        $accpyn4 
    ) = @_;          # INCOMING ARRAY

    my $regex1 = &composeRegex($ACCP1_regex,$ACCP_comp1);
    my $regex2 = &composeRegex($ACCP2_regex,$ACCP_comp2);
    my $regex3 = &composeRegex($ACCP3_regex,$ACCP_comp3);
    my $regex4 = &composeRegex($ACCP4_regex,$ACCP_comp4);

    my @main     = &getTableFC($table);
    my @crosscheck1 = &getTableFC($ACCP_searchver1) if ($regex1);
    my @crosscheck2 = &getTableFC($ACCP_searchver2) if ($regex2);
    my @crosscheck3 = &getTableFC($ACCP_searchver3) if ($regex3);
    my @crosscheck4 = &getTableFC($ACCP_searchver4) if ($regex4);

    my $linecount=0;
    my ($line, $line1, $line2, $line3, $line4) = ('','','','','');

    foreach $line ( @main ) {
        $line1 = $crosscheck1[$linecount];
        $line2 = $crosscheck2[$linecount];
        $line3 = $crosscheck3[$linecount];
        $line4 = $crosscheck4[$linecount];
        $line =~ s/^\s+|\s+$//; chomp $line;
        $line1 =~ s/^\s+|\s+$//; chomp $line1;
        $line2 =~ s/^\s+|\s+$//; chomp $line2;
        $line3 =~ s/^\s+|\s+$//; chomp $line3;
        $line4 =~ s/^\s+|\s+$//; chomp $line4;

        my ($r1,$r2,$r3,$r4) = (0,0,0,0);   #USING THESE TO TALLY MATC
+HES

    # CHECK REGEX MATCHES FOR COMPARISON COLUMNS
    if ($regex1) {
        $r1++;
        if ($accpyn1) { 
            if ($line1 =~m/$regex1/) { $r1++ }
        } else {
            if ($line1 !~ m/$regex1/) { $r1++ }
        }        
    };
    if (($regex2) && ($r1!=1)) {
        $r2++;
        if ($accpyn2) { 
            if ($line2 =~m/$regex2/) { $r2++ }
        } else {
            if ($line2 !~ m/$regex2/) { $r2++ }
        }        
    };
    if (($regex3)&&($r1!=1)&&($r2!=1)) {
        $r3++;
        if ($accpyn3) { 
            if ($line3 =~m/$regex3/) { $r3++ }
        } else {
            if ($line3 !~ m/$regex3/) { $r3++ }
        }        
    };
    if (($regex4)&&($r1!=1)&&($r2!=1)&&($r3!=1)) {
        $r4++;
        if ($accpyn4) { 
            if ($line4 =~m/$regex4/) { $r4++ }
        } else {
            if ($line4 !~ m/$regex4/) { $r4++ }
        }        
    };

    # LINE UP MATCH RESULTS TALLIES
    if ( ($r1!=1) && ($r2!=1) && ($r3!=1) && ($r4!=1) )
        {    #PASSED COMPARISON FILTER
            #DO CODE TO FORMAT & RETURN $line FOR MAIN COLUMN
        }

    } # end foreach $line

} #END SUB processComparison
[download]

As you may notice, I have attempted to exit early from loops that are found to be no longer necessary. If any single column fails to match its regex, the entire row will fail--so there is no need to extensively test the remaining columns. I have also pre-established my regular expressions, though I am not sure if there is a way to improve this. The expression used for each column will remain the same for all rows in the table.

At present, the average return, if using a regular expression of moderate complexity, is between 45 and 65 seconds, and in checking the processing times for various segments, this time is mostly (98%) concentrated in the regex-matching section. If it is strictly a "match"/"no match" for a simple word, without any regex alternations or other complexities involved, I have seen as little as 11 or so seconds. Even that seems a little bit longer than I wish, seeing as the results will be returned to the client's browser.

Can this be streamlined any more?

Blessings,

~Polyglot~

Comment on Efficient regex search on array table Download Code

Replies are listed 'Best First'.
Re: Efficient regex search on array table by jwkrahn (Abbot) on Dec 15, 2022 at 03:58 UTC
`my @crosscheck1 = &getTableFC($ACCP_searchver1) if ($regex1); my @crosscheck2 = &getTableFC($ACCP_searchver2) if ($regex2); my @crosscheck3 = &getTableFC($ACCP_searchver3) if ($regex3); my @crosscheck4 = &getTableFC($ACCP_searchver4) if ($regex4);` [download] From perlsyn: `NOTE: The behaviour of a "my", "state", or "our" modified with a statement modifier conditional or loop construct (for example, "my $x if ...") is undefined. The value of the "my" variable may be "undef", any previously assigned value, or possibly anything else. Don't rely on it. Future versions of perl might do something different from the version of perl you try it out on. Here be dragons.`	[reply] [d/l]
Re: Efficient regex search on array table by sectokia (Pilgrim) on Dec 15, 2022 at 02:35 UTC
Some suggestions: Use Devel::NYTProf to actually benchmark properly and really see where your time is going. Use the 'next' key word to go to the next item iteration in the loop when you know the row cannot match anymore, you don't need those rX counters. Example: `if ($regex1) { next if (( $accpyn1) && ($line1 =~ m/$regex1/)); next if ((!$accypn1) && ($line1 !~ m/$regex1/)); }` [download]	[reply] [d/l]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 15, 2022 at 03:07 UTC
I'm still puzzling over this suggestion. I understand the "next if" part well enough, but the dropping of my counters? I'm using the counters as part of the logic to tell me if the line has matched after all the columns have been checked (I don't know until then). It seems to my mind, and perhaps I am just slow of understanding, that the "next if" cannot be applied to all columns at once. Hence the counters. But if there were some way of adding the count at the same time as checking for a match, in one-liner fashion, that might be workable. Blessings, ~Polyglot~	[reply]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 19, 2022 at 01:49 UTC
I'm responding again to this idea because your suggestion gained traction with me as I puzzled out a solution. I managed to eliminate the counters, as you suggested, and the code executes more efficiently now. Here's what that portion is now reduced to: `next if ( ($regex1) && ( (($accpyn1) && ($line1 !~ m/$regex1/o)) \|\| (!($accpyn1) && ($line1 =~ m/$regex1/o)) ) ); next if ( ($regex2) && ( (($accpyn2) && ($line2 !~ m/$regex2/o)) \|\| (!($accpyn2) && ($line2 =~ m/$regex2/o)) ) ); next if ( ($regex3) && ( (($accpyn3) && ($line3 !~ m/$regex3/o)) \|\| (!($accpyn3) && ($line3 =~ m/$regex3/o)) ) ); next if ( ($regex4) && ( (($accpyn4) && ($line4 !~ m/$regex4/o)) \|\| (!($accpyn4) && ($line4 =~ m/$regex4/o)) ) );` [download] And with that logic, the counters became unnecessary. So thank you for the suggestion--it has been put to practical use. Blessings, ~Polyglot~	[reply] [d/l]
Re: Efficient regex search on array table by kcott (Archbishop) on Dec 16, 2022 at 09:10 UTC
"Can this be streamlined any more?" Short answer: certainly. Longer answer: read on ... Firstly, two things you should stop doing: The syntax for an `if` statement has no terminal semicolon; an `if` statement modifier may have a terminal semicolon. So, write "`if (CONDITION) BLOCK`" or "`... if CONDITION;`". The statement may be extended with `else` and `elsif` blocks (which don't have terminal semicolons either). See perlsyn for full details. Do not call subroutines with a leading '`&`' unless you know exactly what it does and why you are doing it. A leading '`&`' was the normal way to call subroutines in Perl4 (and earlier); it is not normal for Perl5. See perlsub for more on this. When you find yourself writing `$var1`, `$var2`, ..., `$varN`, you're probably doing something wrong, or at least doing something suboptimal which can be greatly improved: often with arrays or hashes. What happens when you want to add a new language, or discontinue supporting an existing one: as it stands, you'll have to engage in a major rewriting exercise. Instead of returning arrays from subroutines, consider returning arrayrefs. With 30,000+ rows, each "`my @array = &getTableFC(...)`" statement is potentially passing a megabyte or more of data; and arrayref is a single, scalar value. I assume that you're trying to trim leading and trailing whitespace with "`s/^\s+\|\s+$//`". That won't work. There are a variety of ways you could do this; one would be to add a `/g` modifier: `$ perl -e 'my $x = " X "; $x =~ s/^\s+\|\s+$//; print "\|$x\|";' \|X \| $ perl -e 'my $x = " X "; $x =~ s/^\s+\|\s+$//g; print "\|$x\|";' \|X\|` [download] Perl's string handling functions and operators are almost always measurably faster than a regex achieving the same functionality. I see no reason to use any regexes in your code. I'd recommend using index(). I've rewritten your code, taking all of the above into account. First some notes: I've no idea where your base data comes from. Presumably a disk file, a database, or whatever. I've just written a `GET_DATA` routine to simulate this. I don't know how your users supply queries. I used the one from your OP, and added a second, in `@user_tests`. I loop through these to simulate multiple queries. I've only used some of your data: "Main", "Comp. 1", and "Comp. 3". I've given them more meaningful names: "ref_lang", "Lang A", and "Lang Q", respectively. This was mainly to show that the code is not bound to any specific number of languages; not wanting to copy-paste all five of your languages was also a factor. I've pared down `processComparison()` to a bare minimum; even with a couple of helper routines, it's still substantially smaller than the original. The "`# For demo`" code equates to your "`#DO CODE TO FORMAT ...`". I didn't know what you wanted here, so I've just output the data. Here's the code: #!/usr/bin/env perl use strict; use warnings; sub GET_DATA (); my @user_tests = ( { ref_lang => { match => 1, strings => ['hollow log', 'fence'], +}, 'Lang A' => { match => 1, strings => ['jumped'], }, 'Lang Q' => { match => 0, strings => ['red fox'], }, }, { ref_lang => { match => 1, strings => ['tree', 'fence'], }, 'Lang A' => { match => 1, strings => ['tree', 'doe'], }, 'Lang Q' => { match => 0, strings => ['gray fox'], }, }, ); for my $i (0 .. $#user_tests) { processComparison(GET_DATA, $user_tests[$i], $i); } sub processComparison { my ($data, $test, $index) = @_; my %results; for my $lang (keys %$test) { $results{$lang} = check_match($data->{$lang}, $test->{$lang}); } # For demo use Data::Dump; print "User Test Index $index:\n"; dd $test; print "Results:\n"; dd \%results; print '-' x 60, "\n"; return; } sub check_match { my ($data, $test) = @_; my @results; DATUM: for my $i (0 .. $#$data) { STRING: for my $string (@{$test->{strings}}) { next STRING unless match_ok( $test->{match}, index $data->[$i], $string ); push @results, $i + 1; next DATUM; } } return \@results; } sub match_ok { my ($match, $index) = @_; return $match ? $index > -1 : $index == -1; } sub GET_DATA () { return { ref_lang => [ 'The red fox jumped over the hollow log.', 'The tall tree towered over the animals.', 'The tawny deer jumped over the fence.', ], 'Lang A' => [ 'The vixen jumped over the brown log.', 'The tree swayed smartly in the field.', 'The doe jumped over the barrier.', ], 'Lang Q' => [ 'The gray fox jumped over the big log.', 'The tall tree grew in the field.', 'The red deer cleared the tall fence.', ], }; } [download] That outputs: User Test Index 0: { "Lang A" => { match => 1, strings => ["jumped"] }, "Lang Q" => { match => 0, strings => ["red fox"] }, "ref_lang" => { match => 1, strings => ["hollow log", "fence"] }, } Results: { "Lang A" => [1, 3], "Lang Q" => [1, 2, 3], "ref_lang" => [1, 3] } ------------------------------------------------------------ User Test Index 1: { "Lang A" => { match => 1, strings => ["tree", "doe"] }, "Lang Q" => { match => 0, strings => ["gray fox"] }, "ref_lang" => { match => 1, strings => ["tree", "fence"] }, } Results: { "Lang A" => [2, 3], "Lang Q" => [2, 3], "ref_lang" => [2, 3] } ------------------------------------------------------------ [download] I didn't know how "the other columns are merely for comparison purposes" was supposed to work; i.e. in terms of what activities you had in mind. So, you have the data; adapt to your needs. — Ken	[reply] [d/l] [select]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 16, 2022 at 10:23 UTC
Ken, They need to add a trophy or star function for superior posts like yours. Upvoting seems poor recompense for the amount of effort you put into that. Thank you--and you correctly discerned some of my failings. Yes, I was intending to remove leading and trailing whitespace. The reason for this is that a space on either side will throw off the matching for searches in which the user specified that the match must occur at the beginning or at the end. So the space removal is for the benefit of the regex later. And, yes, each array processed line-by-line will actually have between 4 MB (at the lowest end) and around 250+ MB for one particular annotated version (with full HTML mouseovers, etc.); but the average being closer to 10 MB each. So you were correct that each one is over a MB. These are all coming from a database, each file represented in a separate table in the DB. The routine which feeds the array pulls every row of the table at once, and this is done to speed up the database portion, by not having to use 30,000+ calls to the DB, one per row, and it was also my understanding that it was less expensive, time-wise, to use some RAM than to make repeated I/O calls. I may be mistaken--you seem to have a good grasp of these things, so feel free to clarify. The clients have two options for forming their query--and these options are individually available on a per-column basis: 1) they can use a simple, standard search, entering a keyword or phrase of their choice, then ticking checkboxes for case-sensitivity, whole-word (\bwhole-word\b) searching, must match at beginning or end, etc.; and 2) they can tick the "Use PERL regex" option which then disables all the other options and they are on their own with specifying what they want to match via formulation of their own regular expression. The subroutine I call for returning the regex handles both alternatives, returning in qr// form. I will try out your code when I have a chance--probably won't be for another couple of days until my next window of opportunity. I very much appreciate your effort. By the way, I didn't see much, if any, improvement with the addition of the "o" (m//o) for matching. I think this might be because the $regex is already in qr// form--but perhaps I'm simply not aware of how that affects things. P.S. Oh, and by the way, I'm developing on Perl 5.12.4. Blessings, ~Polyglot~	[reply]
Re^3: Efficient regex search on array table by kcott (Archbishop) on Dec 16, 2022 at 13:58 UTC
Thankyou for your kind words. By the way, instead of "failings" (negative); think "opportunities for improvement" (positive). Take a look at "perlperf - Perl Performance and Optimization Techniques". There's a lot of information on benchmarking and profiling tools. Use these to determine what's fast and what's slow, where bottlenecks occur, and so on. This is a much better approach than going on gut-feeling, anecdotal evidence, and the like. My $work often involves dealing with biological data (tends to be measured in GB, rather than MB). Functions which return large datasets are a red-flag to me; references to such data are nearly always a better choice. I had thought that queries like "`(hollow log)\|(fence)`" would result in regexes like "`/(?:hollow log\|fence)/`". There was no indication that anything more complex was involved. Your new information indicates that's not the case. For your keyword searches, I'd still recommend `index()`; when using anchors (`^`, `\b`, etc.), and such like, regexes are probably the correct approach. I recommend you change "Use PERL regex" to "Use Perl regex": Perl is the language; `perl` is the program; PERL is not a thing. :-) Good luck with your continued optimisation efforts; and, of course, do ask if you need further help. — Ken	[reply] [d/l] [select]
Re: Efficient regex search on array table by dorko (Prior) on Dec 14, 2022 at 21:14 UTC
Hello Polyglot, I'm not sure if this will buy you any improvements, but you might try the qr operator. You might even already be using `qr` depending on what `&composeRegex` returns. Quoting from the documentation: `qr/STRING/msixpodualn` `This operator quotes (and possibly compiles) its STRING as a regular expression. STRING is interpolated the same way as PATTERN in m/PATTERN/.` I think this example (also from the documentation) gives a good idea of what is possible. my $sentence_rx = qr{ (?: (?<= ^ ) \| (?<= \s ) ) # after start-of-string or # whitespace \p{Lu} # capital letter .? # a bunch of anything (?<= \S ) # that ends in non- # whitespace (?<! \b [DMS]r ) # but isn't a common abbr. (?<! \b Mrs ) (?<! \b Sra ) (?<! \b St ) [.?!] # followed by a sentence # ender (?= $ \| \s ) # in front of end-of-string # or whitespace }sx; local $/ = ""; while (my $paragraph = <>) { say "NEW PARAGRAPH"; my $count = 0; while ($paragraph =~ /($sentence_rx)/g) { printf "\tgot sentence %d: <%s>\n", ++$count, $1; } } [download] I hope this helps, even if only a little. Cheers, Brent -- Yeah, I'm a Delt.*	[reply] [d/l] [select]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 15, 2022 at 05:35 UTC
Well, here are the almost-humorous results. I was able to use the qr// on the output of the composeRegex subroutine--so the implementation was a cinch. After doing so, I found that my return times hovered around 11 or 12 seconds. I increased the regex complexity--almost no difference to the time. I began to have hope that something had really improved. To see how much difference it had made, I again removed the qr// from the subroutine's output and tried the identical query again....11 or 12 seconds--the same as before. It was at this moment that I recalled having discovered a "wine" process running rampant and using cpu resources at full throttle earlier this morning, which I had killed, of course. Hah! Sluggish response just might have something to do with who else is hogging the cpu. And as for MacOSX having a penchant for dependency on "wine," I say it's better to stay away from the inebriating stuff. Unfortunately, in terms of breaking the 11-second barrier, that has not yet happened. Blessings, ~Polyglot~	[reply]
Re^3: Efficient regex search on array table by kcott (Archbishop) on Dec 16, 2022 at 04:16 UTC
G'day Polyglot, Take a look at the benchmark section (at end, in spoiler) of my "Regex /o modifier: what bugs?" post from earlier today. It may give you some insight into why `qr//` is not providing any efficiency improvement. I suggest writing your own benchmark with `$str` and `$re` more closely reflecting your real code and data. `use_m()` & `as_qr()` would likely be useful for you; `use_o()` may be helpful but that depends on parts of your code that you haven't shown; I don't think `raw_m()` or `raw_o()` are relevant for you. By the way, I had mentioned that I was using Perl v5.36 which could be providing optimisations; what version of Perl are you using? — Ken	[reply] [d/l] [select]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 15, 2022 at 03:00 UTC
Thank you for this advice. I was almost sure I was using qr// somewhere in the code...I even remember making some recent use of it. But I went back and looked at that subroutine, and then searched through all parts of the script, and found that I had made very little use of qr//, and none in portions I had recently worked with. What the composeRegex subroutine is doing is basically my own version of quotemeta. As it was sometime back that I labored through that portion, I know I had tried quotemeta and had some issues with it, which is why I ended up doing my own thing--but I don't remember now what the issues were. It probably had something to with characters that I did not want to have escaped that quotemeta would have escaped. I will consider rewriting that portion using qr//. However, I do wonder if that will gain much on the time, as it only executes once per regex, maximum of five times for the entire script run. The time savings would have to be in how the regex was applied during the matches. I will give it a try--but that part is a bit complex, so I'm not sure yet how it will work out. Blessings, ~Polyglot~	[reply]
Re: Efficient regex search on array table by GrandFather (Saint) on Dec 18, 2022 at 20:27 UTC
These matches look like they could be performed by a database select which could be much faster than pulling a bunch of stuff from a database to memory, then searching it. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re^2: Efficient regex search on array table by Polyglot (Chaplain) on Dec 19, 2022 at 01:26 UTC
Most people would agree with you. There are several reasons I do not prefer this route: I will be using, or attempting to use, the full potential of a Perl regex for the query, and converting this to an SQL query would be quite complicated, perhaps impossible (at least for me). Because users will be allowed to enter their own regex for the query, not running their regex on the SQL database gives me a Taint-like layer of protection, as there is zero chance of database manipulation, and nothing will be done/executed on the server with the query output. I trust Perl over MySQL/MariaDB to securely, reliably, and efficiently meet my needs. Blessings, ~Polyglot~	[reply]


No such thing as a small change
	PerlMonks