Parser Performance Question

songmaster has asked for the wisdom of the Perl Monks concerning the following question:

I have a Perl parser for a technical language (details of which are unimportant here). The parser is handed a text file which is of the order of 400KB or larger, read in as a single scalar (which takes a fraction of a second). The parser puts the string into $_ and then uses a series of constructs like those below:

    if (m/\G menu \s* \( \s* $RXstr \s* \) \s* \{/oxgc) {
        my $name = $1;
        parse_menu($name);
    }
    elsif (m/\G driver \s* \( \s* $RXstr \s* \)/oxgc) {
        my $name = $1;
        parse_driver($name);
    }
[download]

The $RXstr used above is defined as:

our $RXname =  qr/ [a-zA-Z0-9_\-:.\[\]<>;]+ /x;
our $RXhex =   qr/ (?: 0 [xX] [0-9A-Fa-f]+ ) /x;
our $RXoct =   qr/ 0 [0-7]* /x;
our $RXuint =  qr/ [0-9]+ /x;
our $RXint =   qr/ -? $RXuint /ox;
our $RXuintx = qr/ ( $RXhex | $RXoct | $RXuint ) /ox;
our $RXintx =  qr/ ( $RXhex | $RXoct | $RXint ) /ox;
our $RXnum =   qr/ -? (?: [0-9]+ | [0-9]* \. [0-9]+ ) (?: [eE] [-+]? [
+0-9]+ )? /x;
our $RXdqs =   qr/ " (?: [^"] | \\" )* " /x;
our $RXstr =   qr/ ( $RXname | $RXnum | $RXdqs ) /ox;
[download]

The individual parse_menu() and parse_driver() routines called in the first code segment above continue parsing from where the previous match succeeded using similar constructs.

This works fine and performs well on Perl versions up until Perl 5.20. Here are some results from running this program under 3 different versions of Perl, measured on MacOS but the regression has been reported on Debian and Ubuntu:

woz$ perlbrew use 5.18.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m0.461s
user    0m0.380s
sys     0m0.020s
woz$ perlbrew use 5.20.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m14.656s
user    0m13.548s
sys     0m0.075s
woz$ perlbrew use 5.24.1
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real    0m9.518s
user    0m8.977s
sys     0m0.044s
[download]

Using NYTProf I have profiled the code and the additional time in the later Perl versions is all attributable to the Parser::CORE:match (opcode). It calculates there are 99062 calls to that opcode in that time period for this particular 406KB input file, spread across 9 separate routines in the parser.

This is obviously a bad regression.

Can anyone advise me how to modify my parser code so it performs well on all versions of Perl? There are other programmers on this project who would love to replace the Perl code with Python, which I really don't think we should need to do, but this level of a performance regression is a problem.

Thanks for any advice...

Comment on Parser Performance Question Select or Download Code

Replies are listed 'Best First'.
Re: Parser Performance Question by kcott (Archbishop) on Oct 05, 2017 at 00:48 UTC
G'day songmaster, As LanX alluded to in his update, I suspect the `/o` modifier may be the issue: it certainly stood out as I read through the code fragments you provided. In "perlre: Modifiers" you'll see: "`o - pretend to optimize your code, but actually introduce bugs`" That provides a link to further information in "perlop: Regexp Quote-Like Operators" but the fragment identifier (`#s%2fPATTERN%2fREPLACEMENT%2fmsixpodualngcer`) is wrong. The closest to that is probably #s/_PATTERN_/_REPLACEMENT_/msixpodualngcer; however, the one with the most information about `/o`, and probably more appropriate given the code you've shown, is #m/_PATTERN_/msixpodualngc, which culminates in: "The bottom line is that using `/o` is almost never a good idea." I probably would have created all of those regexes at compile time, and I would have used `my` instead of `our` variables. A dispatch table with actions based on matches may also be appropriate. You don't show sufficient code to make any direct modification recommendations. The following script simply suggests a technique you could adapt to your needs. #!/usr/bin/env perl use strict; use warnings; my %capture; BEGIN { my $RXone = qr{(?x: 1 )}; my $RXthree = qr{(?x: 3 )}; my $RXnum = qr{(?x: $RXone \| $RXthree )}; my $RXstr = qr{(?x: ( [a-z]+ $RXnum ) )}; %capture = ( menu => { regexp => qr{(?x: ^ menu \s+ $RXstr $ )}, action => sub { parse_menu(@_) }, }, driver => { regexp => qr{(?x: ^ driver \s+ $RXstr $ )}, action => sub { parse_driver(@_) }, }, ); } my @capture_keys = keys %capture; while (<DATA>) { for my $capture_key (@capture_keys) { if (/$capture{$capture_key}{regexp}/) { $capture{$capture_key}{action}->($1); last; } } } sub parse_menu { print "MENU: @_\n" } sub parse_driver { print "DRIVER: @_\n" } __DATA__ menu menu1 driver driver1 other other1 menu menu2 driver driver2 other other2 menu menu3 driver driver3 other other3 [download] You may have sufficient, up-front knowledge about those "capture keys" to predefine an ordered `@capture_keys` rather than relying on the random list returned by `keys`. Output from a sample run of that script: `MENU: menu1 DRIVER: driver1 MENU: menu3 DRIVER: driver3` [download] Update (minor code alteration): My original code had `$capure_key` (missing "t") throughout. I've changed that to `$capture_key` globally; retested; output unchanged. — Ken	[reply] [d/l] [select]
Re^2: Parser Performance Question by Anonymous Monk on Oct 05, 2017 at 15:05 UTC
It turns out that `qr//` is actually the worst way to build regexes, but adding `/o` mostly fixes the problem. use Benchmark qw( cmpthese ); print $], "\n"; open my $W, '<', '/usr/share/dict/words' or die; my @words = <$W>; close $W; my $s = '(?<![cC])[eE][iI]'; my $re = qr/(?<![cC])[eE][iI]/; cmpthese(-5, { re => sub { grep /(?<![cC])[eE][iI]/, @words }, qr => sub { grep /$re/, @words }, qro => sub { grep /$re/o, @words }, s => sub { grep /$s/, @words }, so => sub { grep /$s/o, @words }, }); __END__ 5.026001 Rate qr s qro so re qr 7.14/s -- -57% -67% -68% -69% s 16.8/s 135% -- -22% -26% -26% qro 21.4/s 200% 28% -- -5% -6% so 22.5/s 215% 34% 5% -- -1% re 22.7/s 218% 36% 6% 1% -- [download]	[reply] [d/l]
Re^3: Parser Performance Question by LanX (Saint) on Oct 05, 2017 at 15:18 UTC
Update: sorry please ignore, misread benchmark ~~Well maybe you used the worst way to ask the question.~~ `qr` is meant to precompile, so why should it be used in a loop? The OP is building a parser, his grammar doesn't change in the fly. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re: Parser Performance Question (updated /o) by LanX (Saint) on Oct 04, 2017 at 21:59 UTC
Just some general advice: It's probably an optimization or bug fix having bad side-effects. If I were you I'd try to see if the regex-op-trees stay the same with `use re 'debug';` see `re` for details. You could also have a look into `perldelta 5.20`, to see what changed with 5.20. edit And you should provide a minimal example reproducing the problem, I suppose your benchmark is only reflecting the "real" program we can't probably know. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} update A guess: The `/o` modifier is nowadays mostly useless (IIRC)! Try to measure if it's probably causing the problem.	[reply] [d/l] [select]
Re: Parser Performance Question by songmaster (Beadle) on Oct 05, 2017 at 09:02 UTC
Thanks for the replies so far. Taking out all the `/o` flags (which were supposed to speed up regexes back when we actually used Perl 5.6, yes this project is that old although not this code) helps a bit, now under Perl 5.24.1 the timing is: woz$ perlbrew use 5.18.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m0.417s user 0m0.377s sys 0m0.020s woz$ perlbrew use 5.24.1 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m7.549s user 0m7.215s sys 0m0.077s So that's another 2 seconds saved, but it still takes 7 seconds longer than it does under Perl 5.18.0. @Ken: I'm using `our` variables because they are actually set in another module. Given that the profiler doesn't show any significant amount of time spent in the (presumably related, but I don't know the internals) `Parser::CORE:regcomp` opcode I don't think pre-compiling the regexp's will make any difference. Looking through the individual regexp profiles again, I now see that there is one for detecting Perl POD which is taking up almost all of that 7 seconds: `if (m/\G ( = [a-zA-Z] .* ) \n/xgc) { $obj->add_pod($1, parsePod()); }` [download] Any ideas why this specific regexp is so slow in Perl >= 5.20? It's probably the only one that uses `.` to match to the end of a line. I tried adding `use re "debug";` and it's outputting lots* of lines like this, which given the reference to an anchored substr "=" is probably the above match: doing 'check' fbm scan, [345261..414818] gave 345300 Found floating substr "%n" at offset 345300 (rx_origin now 345259).. +. doing 'other' fbm scan, [345259..345299] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45301 (rx_origin now 345299)... doing 'check' fbm scan, [345301..414818] gave 345317 Found floating substr "%n" at offset 345317 (rx_origin now 345299).. +. doing 'other' fbm scan, [345299..345316] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45318 (rx_origin now 345316)... [download] Further hints on understanding what those messages mean and how to clean this up would be most appreciated. - Andrew	[reply] [d/l] [select]
Re^2: Parser Performance Question (anchors) by LanX (Saint) on Oct 05, 2017 at 09:51 UTC
> there is one for detecting Perl POD which is taking up almost all of that 7 seconds: POD has to start at the beginning of a line, try to anchor your = there. Seems like your regex is backtracking from the end after anchoring at \n, Perl has heuristics to decide if it starts searching from the back or from the start. And if it's not parsing line by line but the whole file, you might end up with an exponential growth by file size (try benchmarking other file sizes) I have to say some of your regexes so far look broken and badly tested, it's not Perl's fault if it's not performing optimal. And PLEASE try to provide a SSCCE including input data to facilitate us helping instead of speculating. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Parser Performance Question by dave_the_m (Monsignor) on Oct 06, 2017 at 07:42 UTC
I agree with LanX: we need a short, self-contained example. For example I tried with this, but it doesn't reproduce the problem: `$n = 'x' x 50 . "\n"; $p = "=foo $n"; $np = ($n x 50) . $p; $_= $np x 100_000; 1 while m/\G ( = [a-zA-Z] .* ) \n/xgc;` [download] In fact for me, 5.20 is 3 times faster than 5.18 with that example. Since for 5.20.0 I heavily reworked the part of the regex engine which is giving those debugging messages you show, I'd be very interested to have access to real working examples of where my changes made things go slower rather than faster. Dave.	[reply] [d/l]
Re^3: Parser Performance Question by songmaster (Beadle) on Oct 19, 2017 at 19:23 UTC
Sorry for the delay in responding further to this, and thanks to everyone for their input. The fix I have committed for now was to move the `.` match out into a separate regex from the `= [a-zA-Z]` part and this works okay, but I would prefer something slightly less ugly. Here is some stand-alone code that demonstrates the regression, although it doesn't show quite as dramatic a slow-up as my original: `#!env perl $l = 'x' x 50 . "\n"; $x = $l x 50; $p = "=foo bar\n"; $_= ($x . $p) x 500 . $x; $nx = 0; while (1) { if (m/\G ( = [a-zA-Z] . ) \n/xgc) { $pod .= $1; } elsif (m/\G x+ \n/xgc) { # match xxx lines $nx++; } else { last; } }` [download] My results show this taking 3-4 times as long under 5.20.0 as under 5.18.0: `woz$ perlbrew use 5.18.0 woz$ time perl re.pl real 0m0.035s user 0m0.026s sys 0m0.004s woz$ perlbrew use 5.20.0 woz$ time perl re.pl real 0m0.128s user 0m0.120s sys 0m0.005s` [download] - Andrew	[reply] [d/l] [select]
Re: Parser Performance Question by Anonymous Monk on Oct 04, 2017 at 22:46 UTC
Hi, don't use the /o flag , /o has been useless since about 5.6 when qr replaced /o, the docs really ought to read o - pretend to optimize your code, but actually introduce bugs and perl ought to reject /o outright :)	[reply]
Re: Parser Performance Question by KurtZ (Friar) on Oct 05, 2017 at 12:39 UTC
Would love to replace the Perl code with Python,� Good luck, python has a compatible regexp engine. I don't think it can solve conceptional problems in your code. :)	[reply]
Re^2: Parser Performance Question by QM (Parson) on Oct 20, 2017 at 09:17 UTC
Python's regexes are not as fully featured as Perl's. For instance, `/i` isn't scoped, it's either enabled or disable for the whole match, IIRC. I would be really really surprised if Perl's regex performance was worse than Python's on a fair comparison. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re: Parser Performance Question by Anonymous Monk on Oct 05, 2017 at 13:26 UTC
Almost everyone seems to think that it is best to "start over" in the language that they prefer . . . until they actually start trying to do it!	[reply]
Re: Parser Performance Question by LanX (Saint) on Oct 05, 2017 at 11:44 UTC
Unrelated to your performance question, but this > `qr/ " (?: [^"] \| \\" ) " /x;`* is broken code. Try to guess what it's supposed to do, and then try to test what it really does, and you'll see why testing is important. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^2: Parser Performance Question by choroba (Cardinal) on Oct 05, 2017 at 11:57 UTC
Please, stop talking in riddles. Do you mean `"a\\"b"`? ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^3: Parser Performance Question (updated) by LanX (Saint) on Oct 05, 2017 at 13:48 UTC
> Please, stop talking in riddles. Riddles are an efficient instrument for motivation in didactics. =) For instance a friend of mine spends loads of time in hacker competitions... ;-) > Do you mean "a\\"b"? What do you think? xD Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!} update `DB<103> '"a\\"b"' =~ / " (?: [^"] \| \\" )* " /x ; print $& # + nonsense "a\" DB<104> '"a\\"b"' =~ / " (?: \\" \| [^"] )* " /x ; print $& # + better "a\"b" DB<105> '"a\\"b"' =~ / " (?: \\\\ \| \\" \| [^"] )* " /x ; print $& # + best "a\"b"` [download] Why #nonsense? Because the or-branch will never match, no matter which input.	[reply] [d/l]
Re^4: Parser Performance Question (updated) by Eily (Monsignor) on Oct 06, 2017 at 01:44 UTC
Re^5: Parser Performance Question (updated) by LanX (Saint) on Oct 06, 2017 at 02:08 UTC
Some notes below your chosen depth have not been shown here
Re: Parser Performance Question by ikegami (Patriarch) on Oct 07, 2017 at 01:32 UTC
Nearly every if not every "`*`", "`+`" and "`?`" in your code should be followed by a "`+`" to prevent unnecessary backtracking. This might side-step the issue?	[reply] [d/l] [select]

Parser Performance Question

edit

update

update