After a long long time using a fairly ancient Perl, I'm now able to use a more modern build. But I'm running into a fairly severe regex performance regression, and not sure where to start looking...
I've reduced it to this small testcase, where he sub below is called with a large input file (118Mb) that's been slurped into a string:
sub parse_foo {
my ($text) = @_;
my $name;
{
last if $text =~ /\G \s* \Z/gcmsx;
if ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.
+*?) \s* \) \s* ;/gcmsx) { $name = $1 }
elsif ($text =~ /\G \s* ^ \s* endfoo /gcmsx) { }
elsif ($text =~ /\G \s* ^ \s* \S+ \s+ .*? \s* ;/gcmsx) { }
else { die "ERROR: unknown syntax\n" }
redo;
}
print "LAST FOO: $name\n";
}
Using 5.8.8, it runs in about 5 seconds. Using 5.30.0, it takes about 105 seconds. (And it's the same story when I try it on the latest stable release, 5.32.1). I ran NYTProf on both 5.8.8 and 5.30.0, and it boils down to the difference in these two lines:
5.8.8
last if $text =~ /\G \s* \Z/gcmsx;
# spent 181ms making 866465 calls to main::CORE:match, avg 208ns/c
+all
if ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.*?) \s
+* \) \s* ;/gcmsx) { $name = $1 }
# spent 3.74s making 2547279 calls to main::CORE:match, avg 1µs/cal
+l
5.30.0
last if $text =~ /\G \s* \Z/gcmsx;
# spent 289ms making 866465 calls to main::CORE:match, avg 334ns/ca
+ll
if ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.*?) \s
+* \) \s* ;/gcmsx) { $name = $1 }
# spent 103s making 2547279 calls to main::CORE:match, avg 41µs/ca
+ll
Am I unwittingly doing something in my code that has been deprecated (I think I got this parsing/nibbling regex style using a block for looping from the original Effective Perl Programming), and now there's a better way?
Thanks!
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.