Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Parser Performance Question

by songmaster (Beadle)
on Oct 05, 2017 at 09:02 UTC ( [id://1200717]=note: print w/replies, xml ) Need Help??


in reply to Parser Performance Question

Thanks for the replies so far. Taking out all the /o flags (which were supposed to speed up regexes back when we actually used Perl 5.6, yes this project is that old although not this code) helps a bit, now under Perl 5.24.1 the timing is:

woz$ perlbrew use 5.18.0
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real	0m0.417s
user	0m0.377s
sys	0m0.020s
woz$ perlbrew use 5.24.1
woz$ time perl -CSD registerRecordDeviceDriver.pl softIoc.dbd

real	0m7.549s
user	0m7.215s
sys	0m0.077s

So that's another 2 seconds saved, but it still takes 7 seconds longer than it does under Perl 5.18.0.

@Ken: I'm using our variables because they are actually set in another module. Given that the profiler doesn't show any significant amount of time spent in the (presumably related, but I don't know the internals) Parser::CORE:regcomp opcode I don't think pre-compiling the regexp's will make any difference.

Looking through the individual regexp profiles again, I now see that there is one for detecting Perl POD which is taking up almost all of that 7 seconds:

if (m/\G ( = [a-zA-Z] .* ) \n/xgc) { $obj->add_pod($1, parsePod()); }

Any ideas why this specific regexp is so slow in Perl >= 5.20? It's probably the only one that uses .* to match to the end of a line.

I tried adding use re "debug"; and it's outputting lots of lines like this, which given the reference to an anchored substr "=" is probably the above match:

doing 'check' fbm scan, [345261..414818] gave 345300 Found floating substr "%n" at offset 345300 (rx_origin now 345259).. +. doing 'other' fbm scan, [345259..345299] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45301 (rx_origin now 345299)... doing 'check' fbm scan, [345301..414818] gave 345317 Found floating substr "%n" at offset 345317 (rx_origin now 345299).. +. doing 'other' fbm scan, [345299..345316] gave -1 Contradicts anchored substr "="; about to retry anchored at offset 3 +45318 (rx_origin now 345316)...

Further hints on understanding what those messages mean and how to clean this up would be most appreciated.

- Andrew

Replies are listed 'Best First'.
Re^2: Parser Performance Question (anchors)
by LanX (Saint) on Oct 05, 2017 at 09:51 UTC
    > there is one for detecting Perl POD which is taking up almost all of that 7 seconds:

    POD has to start at the beginning of a line, try to anchor your = there.

    Seems like your regex is backtracking from the end after anchoring at \n, Perl has heuristics to decide if it starts searching from the back or from the start.

    And if it's not parsing line by line but the whole file, you might end up with an exponential growth by file size (try benchmarking other file sizes)

    I have to say some of your regexes so far look broken and badly tested, it's not Perl's fault if it's not performing optimal.

    And PLEASE try to provide a SSCCE including input data to facilitate us helping instead of speculating.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re^2: Parser Performance Question
by dave_the_m (Monsignor) on Oct 06, 2017 at 07:42 UTC
    I agree with LanX: we need a short, self-contained example. For example I tried with this, but it doesn't reproduce the problem:
    $n = 'x' x 50 . "\n"; $p = "=foo $n"; $np = ($n x 50) . $p; $_= $np x 100_000; 1 while m/\G ( = [a-zA-Z] .* ) \n/xgc;
    In fact for me, 5.20 is 3 times faster than 5.18 with that example. Since for 5.20.0 I heavily reworked the part of the regex engine which is giving those debugging messages you show, I'd be very interested to have access to real working examples of where my changes made things go slower rather than faster.

    Dave.

      Sorry for the delay in responding further to this, and thanks to everyone for their input. The fix I have committed for now was to move the .* match out into a separate regex from the = [a-zA-Z] part and this works okay, but I would prefer something slightly less ugly.

      Here is some stand-alone code that demonstrates the regression, although it doesn't show quite as dramatic a slow-up as my original:

      #!env perl $l = 'x' x 50 . "\n"; $x = $l x 50; $p = "=foo bar\n"; $_= ($x . $p) x 500 . $x; $nx = 0; while (1) { if (m/\G ( = [a-zA-Z] .* ) \n/xgc) { $pod .= $1; } elsif (m/\G x+ \n/xgc) { # match xxx lines $nx++; } else { last; } }

      My results show this taking 3-4 times as long under 5.20.0 as under 5.18.0:

      woz$ perlbrew use 5.18.0 woz$ time perl re.pl real 0m0.035s user 0m0.026s sys 0m0.004s woz$ perlbrew use 5.20.0 woz$ time perl re.pl real 0m0.128s user 0m0.120s sys 0m0.005s
      - Andrew

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1200717]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-25 07:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found