Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Parser Performance Question

by songmaster (Beadle)
on Oct 04, 2017 at 21:33 UTC ( [id://1200691]=perlquestion: print w/replies, xml ) Need Help??

songmaster has asked for the wisdom of the Perl Monks concerning the following question:

I have a Perl parser for a technical language (details of which are unimportant here). The parser is handed a text file which is of the order of 400KB or larger, read in as a single scalar (which takes a fraction of a second). The parser puts the string into $_ and then uses a series of constructs like those below:

if (m/\G menu \s* \( \s* $RXstr \s* \) \s* \{/oxgc) { my $name = $1; parse_menu($name); } elsif (m/\G driver \s* \( \s* $RXstr \s* \)/oxgc) { my $name = $1; parse_driver($name); }

The $RXstr used above is defined as:

our $RXname = qr/ [a-zA-Z0-9_\-:.\[\]<>;]+ /x; our $RXhex = qr/ (?: 0 [xX] [0-9A-Fa-f]+ ) /x; our $RXoct = qr/ 0 [0-7]* /x; our $RXuint = qr/ [0-9]+ /x; our $RXint = qr/ -? $RXuint /ox; our $RXuintx = qr/ ( $RXhex | $RXoct | $RXuint ) /ox; our $RXintx = qr/ ( $RXhex | $RXoct | $RXint ) /ox; our $RXnum = qr/ -? (?: [0-9]+ | [0-9]* \. [0-9]+ ) (?: [eE] [-+]? [ +0-9]+ )? /x; our $RXdqs = qr/ " (?: [^"] | \\" )* " /x; our $RXstr = qr/ ( $RXname | $RXnum | $RXdqs ) /ox;

The individual parse_menu() and parse_driver() routines called in the first code segment above continue parsing from where the previous match succeeded using similar constructs.

This works fine and performs well on Perl versions up until Perl 5.20. Here are some results from running this program under 3 different versions of Perl, measured on MacOS but the regression has been reported on Debian and Ubuntu:

woz$ perlbrew use 5.18.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m0.461s user 0m0.380s sys 0m0.020s woz$ perlbrew use 5.20.0 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m14.656s user 0m13.548s sys 0m0.075s woz$ perlbrew use 5.24.1 woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd real 0m9.518s user 0m8.977s sys 0m0.044s

Using NYTProf I have profiled the code and the additional time in the later Perl versions is all attributable to the Parser::CORE:match (opcode). It calculates there are 99062 calls to that opcode in that time period for this particular 406KB input file, spread across 9 separate routines in the parser.

This is obviously a bad regression.

Can anyone advise me how to modify my parser code so it performs well on all versions of Perl? There are other programmers on this project who would love to replace the Perl code with Python, which I really don't think we should need to do, but this level of a performance regression is a problem.

Thanks for any advice...

Replies are listed 'Best First'.
Re: Parser Performance Question
by kcott (Archbishop) on Oct 05, 2017 at 00:48 UTC

    G'day songmaster,

    As LanX alluded to in his update, I suspect the /o modifier may be the issue: it certainly stood out as I read through the code fragments you provided.

    In "perlre: Modifiers" you'll see:

    "o  - pretend to optimize your code, but actually introduce bugs"

    That provides a link to further information in "perlop: Regexp Quote-Like Operators" but the fragment identifier (#s%2fPATTERN%2fREPLACEMENT%2fmsixpodualngcer) is wrong. The closest to that is probably #s/_PATTERN_/_REPLACEMENT_/msixpodualngcer; however, the one with the most information about /o, and probably more appropriate given the code you've shown, is #m/_PATTERN_/msixpodualngc, which culminates in:

    "The bottom line is that using /o is almost never a good idea."

    I probably would have created all of those regexes at compile time, and I would have used my instead of our variables. A dispatch table with actions based on matches may also be appropriate.

    You don't show sufficient code to make any direct modification recommendations. The following script simply suggests a technique you could adapt to your needs.

    #!/usr/bin/env perl use strict; use warnings; my %capture; BEGIN { my $RXone = qr{(?x: 1 )}; my $RXthree = qr{(?x: 3 )}; my $RXnum = qr{(?x: $RXone | $RXthree )}; my $RXstr = qr{(?x: ( [a-z]+ $RXnum ) )}; %capture = ( menu => { regexp => qr{(?x: ^ menu \s+ $RXstr $ )}, action => sub { parse_menu(@_) }, }, driver => { regexp => qr{(?x: ^ driver \s+ $RXstr $ )}, action => sub { parse_driver(@_) }, }, ); } my @capture_keys = keys %capture; while (<DATA>) { for my $capture_key (@capture_keys) { if (/$capture{$capture_key}{regexp}/) { $capture{$capture_key}{action}->($1); last; } } } sub parse_menu { print "MENU: @_\n" } sub parse_driver { print "DRIVER: @_\n" } __DATA__ menu menu1 driver driver1 other other1 menu menu2 driver driver2 other other2 menu menu3 driver driver3 other other3

    You may have sufficient, up-front knowledge about those "capture keys" to predefine an ordered @capture_keys rather than relying on the random list returned by keys.

    Output from a sample run of that script:

    MENU: menu1 DRIVER: driver1 MENU: menu3 DRIVER: driver3

    Update (minor code alteration): My original code had $capure_key (missing "t") throughout. I've changed that to $capture_key globally; retested; output unchanged.

    — Ken

      It turns out that qr// is actually the worst way to build regexes, but adding /o mostly fixes the problem.
      use Benchmark qw( cmpthese ); print $], "\n"; open my $W, '<', '/usr/share/dict/words' or die; my @words = <$W>; close $W; my $s = '(?<![cC])[eE][iI]'; my $re = qr/(?<![cC])[eE][iI]/; cmpthese(-5, { re => sub { grep /(?<![cC])[eE][iI]/, @words }, qr => sub { grep /$re/, @words }, qro => sub { grep /$re/o, @words }, s => sub { grep /$s/, @words }, so => sub { grep /$s/o, @words }, }); __END__ 5.026001 Rate qr s qro so re qr 7.14/s -- -57% -67% -68% -69% s 16.8/s 135% -- -22% -26% -26% qro 21.4/s 200% 28% -- -5% -6% so 22.5/s 215% 34% 5% -- -1% re 22.7/s 218% 36% 6% 1% --
        Update: sorry please ignore, misread benchmark


        Well maybe you used the worst way to ask the question.

        qr is meant to precompile, so why should it be used in a loop?

        The OP is building a parser, his grammar doesn't change in the fly.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Parser Performance Question (updated /o)
by LanX (Saint) on Oct 04, 2017 at 21:59 UTC
    Just some general advice:

    It's probably an optimization or bug fix having bad side-effects.

    If I were you I'd try to see if the regex-op-trees stay the same with use re 'debug';

    see re for details.

    You could also have a look into perldelta 5.20, to see what changed with 5.20.

    edit

    And you should provide a minimal example reproducing the problem, I suppose your benchmark is only reflecting the "real" program we can't probably know.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    update

    A guess: The /o modifier is nowadays mostly useless (IIRC)!

    Try to measure if it's probably causing the problem.

Re: Parser Performance Question
by songmaster (Beadle) on Oct 05, 2017 at 09:02 UTC

    Thanks for the replies so far. Taking out all the /o flags (which were supposed to speed up regexes back when we actually used Perl 5.6, yes this project is that old although not this code) helps a bit, now under Perl 5.24.1 the timing is:

    woz$ perlbrew use 5.18.0
    woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd
    
    real	0m0.417s
    user	0m0.377s
    sys	0m0.020s
    woz$ perlbrew use 5.24.1
    woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd
    
    real	0m7.549s
    user	0m7.215s
    sys	0m0.077s

    So that's another 2 seconds saved, but it still takes 7 seconds longer than it does under Perl 5.18.0.

    @Ken: I'm using our variables because they are actually set in another module. Given that the profiler doesn't show any significant amount of time spent in the (presumably related, but I don't know the internals) Parser::CORE:regcomp opcode I don't think pre-compiling the regexp's will make any difference.

    Looking through the individual regexp profiles again, I now see that there is one for detecting Perl POD which is taking up almost all of that 7 seconds:

    if (m/\G ( = [a-zA-Z] .* ) \n/xgc) { $obj->add_pod($1, parsePod()); }

    Any ideas why this specific regexp is so slow in Perl >= 5.20? It's probably the only one that uses .* to match to the end of a line.

    I tried adding use re "debug"; and it's outputting lots of lines like this, which given the reference to an anchored substr "=" is probably the above match:

    doing 'check' fbm scan, [345261..414818] gave 345300 Found floating substr "%n" at offset 345300 (rx_origin now 345259).. +. doing 'other' fbm scan, [345259..345299] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45301 (rx_origin now 345299)... doing 'check' fbm scan, [345301..414818] gave 345317 Found floating substr "%n" at offset 345317 (rx_origin now 345299).. +. doing 'other' fbm scan, [345299..345316] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45318 (rx_origin now 345316)...

    Further hints on understanding what those messages mean and how to clean this up would be most appreciated.

    - Andrew
      > there is one for detecting Perl POD which is taking up almost all of that 7 seconds:

      POD has to start at the beginning of a line, try to anchor your = there.

      Seems like your regex is backtracking from the end after anchoring at \n, Perl has heuristics to decide if it starts searching from the back or from the start.

      And if it's not parsing line by line but the whole file, you might end up with an exponential growth by file size (try benchmarking other file sizes)

      I have to say some of your regexes so far look broken and badly tested, it's not Perl's fault if it's not performing optimal.

      And PLEASE try to provide a SSCCE including input data to facilitate us helping instead of speculating.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      I agree with LanX: we need a short, self-contained example. For example I tried with this, but it doesn't reproduce the problem:
      $n = 'x' x 50 . "\n"; $p = "=foo $n"; $np = ($n x 50) . $p; $_= $np x 100_000; 1 while m/\G ( = [a-zA-Z] .* ) \n/xgc;
      In fact for me, 5.20 is 3 times faster than 5.18 with that example. Since for 5.20.0 I heavily reworked the part of the regex engine which is giving those debugging messages you show, I'd be very interested to have access to real working examples of where my changes made things go slower rather than faster.

      Dave.

        Sorry for the delay in responding further to this, and thanks to everyone for their input. The fix I have committed for now was to move the .* match out into a separate regex from the = [a-zA-Z] part and this works okay, but I would prefer something slightly less ugly.

        Here is some stand-alone code that demonstrates the regression, although it doesn't show quite as dramatic a slow-up as my original:

        #!env perl $l = 'x' x 50 . "\n"; $x = $l x 50; $p = "=foo bar\n"; $_= ($x . $p) x 500 . $x; $nx = 0; while (1) { if (m/\G ( = [a-zA-Z] .* ) \n/xgc) { $pod .= $1; } elsif (m/\G x+ \n/xgc) { # match xxx lines $nx++; } else { last; } }

        My results show this taking 3-4 times as long under 5.20.0 as under 5.18.0:

        woz$ perlbrew use 5.18.0 woz$ time perl re.pl real 0m0.035s user 0m0.026s sys 0m0.004s woz$ perlbrew use 5.20.0 woz$ time perl re.pl real 0m0.128s user 0m0.120s sys 0m0.005s
        - Andrew
Re: Parser Performance Question
by Anonymous Monk on Oct 04, 2017 at 22:46 UTC
    Hi, don't use the /o flag , /o has been useless since about 5.6 when qr replaced /o, the docs really ought to read o - pretend to optimize your code, but actually introduce bugs and perl ought to reject /o outright :)
Re: Parser Performance Question
by KurtZ (Friar) on Oct 05, 2017 at 12:39 UTC
    Would love to replace the Perl code with Python, 
    Good luck, python has a compatible regexp engine. I don't think it can solve conceptional problems in your code. :)
      Python's regexes are not as fully featured as Perl's. For instance, /i isn't scoped, it's either enabled or disable for the whole match, IIRC.

      I would be really really surprised if Perl's regex performance was worse than Python's on a fair comparison.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Parser Performance Question
by Anonymous Monk on Oct 05, 2017 at 13:26 UTC
    Almost everyone seems to think that it is best to "start over" in the language that they prefer . . . until they actually start trying to do it!
Re: Parser Performance Question
by LanX (Saint) on Oct 05, 2017 at 11:44 UTC
    Unrelated to your performance question, but this

    >  qr/ " (?: [^"] | \\" )* " /x;

    is broken code.

    Try to guess what it's supposed to do, and then try to test what it really does, and you'll see why testing is important.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Please, stop talking in riddles. Do you mean "a\\"b"?
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        > Please, stop talking in riddles.

        Riddles are an efficient instrument for motivation in didactics. =)

        For instance a friend of mine spends loads of time in hacker competitions... ;-)

        > Do you mean "a\\"b"?

        What do you think? xD

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        update

        DB<103> '"a\\"b"' =~ / " (?: [^"] | \\" )* " /x ; print $& # + nonsense "a\" DB<104> '"a\\"b"' =~ / " (?: \\" | [^"] )* " /x ; print $& # + better "a\"b" DB<105> '"a\\"b"' =~ / " (?: \\\\ | \\" | [^"] )* " /x ; print $& # + best "a\"b"

        Why #nonsense? Because the or-branch will never match, no matter which input.

Re: Parser Performance Question
by ikegami (Patriarch) on Oct 07, 2017 at 01:32 UTC

    Nearly every if not every "*", "+" and "?" in your code should be followed by a "+" to prevent unnecessary backtracking. This might side-step the issue?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1200691]
Approved by kcott
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-19 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found