Unefficient Regexp 'Matching this or that'

pelagic has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

I got this case of an unefficient regexp handling when matching strings in a large file:

To look for ONE string takes 2 seconds
while looking for TWO strings takes 79 seconds. Here is the code:

use strict;
use Benchmark;

my $file = shift || 'no_file';

timethese(
    1,
    {
        'one_string'  => sub { one_string() },
        'two_string'  => sub { two_string() },
    }
);

sub one_string {
    my $filter = '00901808';
    my $re = qr/$filter/o;
    my @matched;
    open (my $FH, "<$file");
    while (my $rec = <$FH>) {
    if ( $rec =~ $re) {
            push @matched, $rec;
    }
    }
    close $FH;
}

sub two_string {
    my $filter = '00901808|87654321';
    my $re = qr/$filter/o;
    my @matched;
    open (my $FH, "<$file");
    while (my $rec = <$FH>) {
    if ( $rec =~ $re) {
            push @matched, $rec;
    }
    }
    close $FH;
}

__END__
[download]

The result says:

# perl bench_regexp 100000lines.92MB.file 
Benchmark: timing 1 iterations of one_string, two_string...
one_string:  2 wallclock secs ( 1.68 usr +  0.42 sys =  2.10 CPU) @  0
+.48/s (n=1)
            (warning: too few iterations for a reliable count)
two_string: 77 wallclock secs (76.13 usr +  0.59 sys = 76.72 CPU) @  0
+.01/s (n=1)
            (warning: too few iterations for a reliable count)
[download]

pelagic

Comment on Unefficient Regexp 'Matching this or that' Select or Download Code

Replies are listed 'Best First'.
Re: Unefficient Regexp 'Matching this or that' by bart (Canon) on May 19, 2010 at 08:55 UTC
Alternatives are slow. Jeffrey Friedl wrote about it in his book about regexes, and your benchmark proves it. I expect an improvement if you just add a common lookahead: `$filter = qr/(?=[08])(?:00901808\|87654321)/;` [download] but I have no way to test it without your data file. See also the modules Regex::PreSuf, Regexp::List and Regexp::Assemble for simpler solutions, as they can build a (hopefully) more efficient regex for you. p.s. I just found this blog post about just this situation. Alternation is usually slow in Perl because the engine has to backtrack when trying each alternative. It's much faster to give perl a character sieve up front, e.g. (?=cdb) and then factor out common prefixes and suffixes. The problem is that when you have a ton of alternatives, doing all this is a pain and it decreases readability to almost zero. Which is why I had avoided it to date...	[reply] [d/l]
Re^2: Unefficient Regexp 'Matching this or that' by pelagic (Priest) on May 19, 2010 at 09:33 UTC
Here goes my test for your suggestion: `Benchmark: timing 3 iterations of 20.two_string, 30.two_string_lookahe +ad... 20.two_string: 236 wallclock secs (234.31 usr + 1.62 sys = 235.93 CPU +) @ 0.01/s (n=3) 30.two_string_lookahead: 178 wallclock secs (176.23 usr + 1.52 sys = +177.75 CPU) @ 0.02/s (n=3)` [download] So it's better, but still not linear ... pelagic	[reply] [d/l]
Re: Unefficient Regexp 'Matching this or that' by moritz (Cardinal) on May 19, 2010 at 08:35 UTC
Try also to run them in reversed order - maybe reading the file is slow the first time, and much faster then second time because the file is then in some cache from the operating system. Also please compare the number of matches found in both cases - is it comparable? Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Unefficient Regexp 'Matching this or that' by pelagic (Priest) on May 19, 2010 at 08:47 UTC
I actually tried it in reverse order also by calling the tests: `01_two_string 02_one_string` [download] And the number of matches is the same, I just stripped this away to make the code smaller. pelagic	[reply] [d/l]
Re^3: Unefficient Regexp 'Matching this or that' by moritz (Cardinal) on May 19, 2010 at 08:58 UTC
Another idea: The regex engine has a heavy optimization for looking for literal substrings, using the same algorithm that index uses. In the case of one literal that can be used, but I don't think it's used in alternations. You can test this hypothesis by comparing regexes of two and three alternatives - if I'm right, then both should be roughly equal in speed, and much slower than a single, literal string.	[reply]
Re^4: Unefficient Regexp 'Matching this or that' by pelagic (Priest) on May 19, 2010 at 09:12 UTC
Re: Unefficient Regexp 'Matching this or that' by BrowserUk (Patriarch) on May 19, 2010 at 10:11 UTC
You might consider using separate passes for each string to be found (or a loop over them). The following, with values suited to a large file I had kicking around, ran substantially faster than the alernation regex: `sub two_pass { my @matched; open (my $FH, "<$file"); while (my $rec = <$FH>) { if ( $rec =~ '500_000' ) { push @matched, $rec; } if ( $rec =~ '500_001' ) { push @matched, $rec; } } print "two pass: " . @matched; close $FH; }` [download] Actually, it consistently runs faster than the single string sub, even if I force it to run first, last or in the middle. Whatever position, the timings of all three methods remain very consistent I don't have an explanation for that? I've noticed in the past that with the 5.10+ trie optimisation, alternation regexes are sometimes slower than they are under 5.8. There seems to be a "sweet spot" where it comes into its own: neither too few nor too many alternations and it flies, but below or above those limits and it actually takes longer. For example, on 5.10, I get these (quite surprising) results consistently: c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 6 wallclock secs ( 5.69 usr + 0.78 sys = 6.47 CPU) @ 0 +.15/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.34 usr + 0.73 sys = 15.07 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 15 wallclock secs (14.76 usr + 0.76 sys = 15.52 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count) [download] But on 5.8 I get these rather more understandable results: c:\test>\perl32\bin\perl junk31.pl 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 8 wallclock secs ( 6.60 usr + 0.69 sys = 7.29 CPU) @ 0 +.14/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 53 wallclock secs (52.99 usr + 0.70 sys = 53.70 CPU) @ + 0.02/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 7 wallclock secs ( 5.60 usr + 0.92 sys = 6.52 CPU) @ + 0.15/s (n=1) (warning: too few iterations for a reliable count) [download] Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: Unefficient Regexp 'Matching this or that' by moritz (Cardinal) on May 19, 2010 at 10:28 UTC
Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit. Maybe you can disable trie optimizations by setting ${^RE_TRIE_MAXBUF} = 0? Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^3: Unefficient Regexp 'Matching this or that' by BrowserUk (Patriarch) on May 19, 2010 at 10:33 UTC
That seems to work. !5 seconds instead of 53 for the alternation is much more reasonable. Still doesn't beat the two pass method, but there is probably a break point somewhere. c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 7 wallclock secs ( 5.60 usr + 0.83 sys = 6.43 CPU) @ 0 +.16/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.35 usr + 0.94 sys = 15.29 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 16 wallclock secs (14.76 usr + 0.73 sys = 15.49 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count) [download] I'm still puzzled why the two pass takes less time than the one string, but the numbers are right and I can't anything major wrong with the benchmark?	[reply] [d/l]
Re: Unefficient Regexp 'Matching this or that' by rovf (Priest) on May 19, 2010 at 10:41 UTC
If you change in the one_string version the filter from `'00901808'` to `'87654321'`, does the execution time stay the same? Also, if you replace the usage of regexps by a call to the `index` function (which in your case should be the same), and then compare the one_string with the two_string version, how does the time increase in the second case? -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]
Re: Unefficient Regexp 'Matching this or that' by JavaFan (Canon) on May 19, 2010 at 16:35 UTC
What version of perl are you using? I think this case got a heavy speed up in 5.10. How does your data look like? How many matches are there in each case? Can it be that the optimizer decides "no match" for the 00901808 case, while it cannot do that for 87654321? How does `/00901808/ \|\| /87654321/` perform?	[reply] [d/l]
Re^2: Unefficient Regexp 'Matching this or that' by BrowserUk (Patriarch) on May 19, 2010 at 18:01 UTC
I think this case got a heavy speed up in 5.10. Not in all cases. See Re: Unefficient Regexp 'Matching this or that'. For low and high numbers of alternations, it is actually a pessimisation.	[reply]

Back to Seekers of Perl Wisdom