Re: Inexplicably slow regex
by GrandFather (Saint) on Sep 12, 2006 at 21:36 UTC
|
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $key = 'x';
my $str = ("\n The quick brown frog jumped over the lazy dog" x 10000
+) . "\n x";
cmpthese (-10,
{
sol => \&sol,
nl => \&nl,
lb => \&lb,
}
);
sub sol {$str =~ /^ \s* $key /mx;}
sub nl {$str =~ /\n \s* $key /mx;}
sub lb {$str =~ /(?<![^\n]) \s* $key /mx;}
Rate sol lb nl
sol 0.316/s -- -99% -100%
lb 35.0/s 10997% -- -92%
nl 441/s 139470% 1158% --
DWIM is Perl's answer to Gödel
| [reply] [d/l] |
|
FWIW, I see the opposite behavior when $key = 'lazy'; (I.E. when $key matches often):
Rate lb nl sol
lb 10.3/s -- -91% -91%
nl 109/s 965% -- -9%
sol 120/s 1066% 9% --
| [reply] [d/l] [select] |
|
I think the lookback is what is killing you here. The first two REs are anchored with a constant (^, \n), so they severely limit the possible starting point. The third has to look at every character in your huge file before it can decide if it's a valid starting point (if I understand how the perl RE engine works).
So in the benchmark above, the first two have to try matching about 10,000 times (since there are 10,000 valid starting points), while the last one has to consider 480,000 possible REs (since every character could be valid).
(For the interested, try looking at the output of -Mre=debug on each of these - it was quite interesting. Thanks to the kind monk on the CB that reminded me of it!)
| [reply] |
|
I think the lookback is what is killing you here.
But the question is why is the "sol" version so much slower than even the "lb" version.
I suspect it all boils down to optimization. All three cases could, in theory, anchor to "\n" characters but, in practice, the optimizer may not be smart enough to realize this.
My interpretation (aka "wild guess") of GrandFather's numbers:
Rate sol lb nl
sol 0.316/s -- -99% -100%
lb 35.0/s 10997% -- -92%
nl 441/s 139470% 1158% --
is that the "lb" regex is a bit more complex and so runs a bit slower while the "sol" regex runs so much slower that I'd expect it to be the one which is hitting way too many possible starting points rather than jumping to key spots such as "\n" (even more speed from Boyer-Moore probably doesn't apply here since I don't think any of these regexes are simple enough).
Yes, I've been hoping someone would use -Mre=debug and summarize what it reported on a system that saw the "sol" regex being especially slow. I think that is the most likely route at explaining the "problem". Then it would be interesting to compare that against what it reports for systems that don't see "sol" being so slow.
| [reply] [d/l] |
|
Re: Inexplicably slow regex
by Skeeve (Parson) on Sep 12, 2006 at 18:59 UTC
|
Can't confirm your observations:
use Benchmark qw( cmpthese );
$_ = '
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
df asdf sadf s f
';
our $KEY="df";
cmpthese(-3, {
caret => '/^ \s* $KEY /mx',
cr => '/\n \s* $KEY /mx',
});
Result
Rate caret cr
cr 1118569/s -- -1%
caret 1133366/s 1% --
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [d/l] [select] |
|
Even with 0.5MB: Still can't confirm!
use Benchmark qw( cmpthese );
our $KEY="df";
{
local $/;
open my $hosts,'/etc/hosts';
$_= <$hosts>;
close $hosts;
}
$_= $_ x (0.5*1024*1024 / length $_);
s/$KEY/XX/g;
$_.= "\n df ";
print length $_,"\n";
cmpthese(-3, {
caret => '/^ \s* $KEY /mx',
cr => '/\n \s* $KEY /mx',
});
Result
524114
Rate cr caret
cr 1108181/s -- -1%
caret 1118184/s 1% --
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [d/l] [select] |
|
It looks like it needs many lines to have a partial match in order to slow things down. Try executing this line before adding the matching line:
s/^/ /mg;
| [reply] [d/l] |
|
|
|
Perhaps the difference is that the text I'm searching (read from a file) is 0.5MB. Also, I only expect one match in the entire file so your example may not show a difference since a match is found immediately.
| [reply] |
Re: Inexplicably slow regex
by gaspodethewonderdog (Monk) on Sep 12, 2006 at 19:00 UTC
|
I think you've left out some data... like what is the contents of your variable before the =~. The /m is clearly changing the behavior of the ^, which I am assuming that you are doing this because you are searching through an entire file. Given the m changing the behavior of the ^ it is likely that it is changing to a much more complicated statement than \n. I know the docs state that the ^ and $ become \n or equivalent, but it may evaluate to something much more complicated. Finally you're not really providing context for how you are actually exec'ing this code. I can imagine that if this is within an eval or something, that it would be a hit if every time that it was eval'd it has to interpret the regex... the second regex has much less processing to do than the first one. | [reply] |
|
The data I'm searching is a 0.5MB text file. The search is performed only once so evaluating the regex shouldn't be an issue. I'm sure perl is doing something complicated internally with the ^. I'd like to know what. I'd also like a simpler alternative than my negative look-behind assertion on a negated character class to replace the ^ and still get decent performance.
| [reply] |
Re: Inexplicably slow regex
by duff (Parson) on Sep 12, 2006 at 18:15 UTC
|
I don't have an answer to your question but could you show your benchmarks for those regular expressions?
| [reply] |
|
I'd also be interested in seeing the benchmarks. I can add one thing: the last Regex isn't the same as the others because the [^\n] character class is (accidently) a negated character class. Rearranging it to [\n^] should fix that.
Hays
| [reply] [d/l] [select] |
|
Nothing accidental about the negation of that character class. I'm using it with a negative look-behind assertion. It's supposed to mean: "If it's not true that the preceding character is not a linebreak". So it's either a linebreak or there's nothing there at all (beginning of string).
Processing a file approximately 0.5MB in size, using gettimeofday for timing, I get
First version: 0.7 seconds
Second version: 0.003 seconds
Third version: 0.03 seconds
My actual regexes are slightly more complicated than the examples given so I see little speed difference between #2 and #3.
| [reply] |
|
|
Problem Identified
by Anonymous Monk on Sep 13, 2006 at 17:50 UTC
|
Thanks to Zigon for suggesting the -Mre=debug switch which helped explain what's going on. And thanks to grandfather and duckyd for providing insightful examples to help isolate the problem. And thanks to everyone else who spent any effort trying to help.
Here's what it appears is happening in the ^ case:
- The regex engine looks for a match for the fixed pattern $Key starting from the current position.
- If that pattern does not occur in the string then there can be no match and the regex fails.
- If that pattern does match, the engine backs up to the current position and starts matching the start of the pattern.
- When #3 fails, the engine advances the current position to the next line and goes back to #1, instead of #2 as might be expected.
So the problem is that the engine performs a scan of the input for the fixed pattern $KEY 10000 times - the first time starting on line 1, the next starting on line 2, etc. This is an n2 operation where n is the length of the string. For some reason the other two versions don't repeat step 1 so are much faster for large input.
Although there may be design issues requiring this behavior, from my perspective it appears to be a bug. Does anyone know where I should report this? Or whether it's a known shortcoming of the regex engine design and I shouldn't bother?
| [reply] |
|
| [reply] |