Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Inexplicably slow regex

by GrandFather (Sage)
on Sep 12, 2006 at 21:36 UTC ( #572633=note: print w/replies, xml ) Need Help??

in reply to Inexplicably slow regex

A simple benchmark which shows somethng like OP's result is:

use strict; use warnings; use Benchmark qw(cmpthese); my $key = 'x'; my $str = ("\n The quick brown frog jumped over the lazy dog" x 10000 +) . "\n x"; cmpthese (-10, { sol => \&sol, nl => \&nl, lb => \&lb, } ); sub sol {$str =~ /^ \s* $key /mx;} sub nl {$str =~ /\n \s* $key /mx;} sub lb {$str =~ /(?<![^\n]) \s* $key /mx;} Rate sol lb nl sol 0.316/s -- -99% -100% lb 35.0/s 10997% -- -92% nl 441/s 139470% 1158% --

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re^2: Inexplicably slow regex
by duckyd (Hermit) on Sep 12, 2006 at 23:37 UTC
    FWIW, I see the opposite behavior when $key = 'lazy'; (I.E. when $key matches often):
    Rate lb nl sol lb 10.3/s -- -91% -91% nl 109/s 965% -- -9% sol 120/s 1066% 9% --
Re^2: Inexplicably slow regex
by zigdon (Deacon) on Sep 13, 2006 at 15:51 UTC

    I think the lookback is what is killing you here. The first two REs are anchored with a constant (^, \n), so they severely limit the possible starting point. The third has to look at every character in your huge file before it can decide if it's a valid starting point (if I understand how the perl RE engine works).

    So in the benchmark above, the first two have to try matching about 10,000 times (since there are 10,000 valid starting points), while the last one has to consider 480,000 possible REs (since every character could be valid).

    (For the interested, try looking at the output of -Mre=debug on each of these - it was quite interesting. Thanks to the kind monk on the CB that reminded me of it!)

    -- zigdon

      I think the lookback is what is killing you here.

      But the question is why is the "sol" version so much slower than even the "lb" version.

      I suspect it all boils down to optimization. All three cases could, in theory, anchor to "\n" characters but, in practice, the optimizer may not be smart enough to realize this.

      My interpretation (aka "wild guess") of GrandFather's numbers:

      Rate sol lb nl sol 0.316/s -- -99% -100% lb 35.0/s 10997% -- -92% nl 441/s 139470% 1158% --

      is that the "lb" regex is a bit more complex and so runs a bit slower while the "sol" regex runs so much slower that I'd expect it to be the one which is hitting way too many possible starting points rather than jumping to key spots such as "\n" (even more speed from Boyer-Moore probably doesn't apply here since I don't think any of these regexes are simple enough).

      Yes, I've been hoping someone would use -Mre=debug and summarize what it reported on a system that saw the "sol" regex being especially slow. I think that is the most likely route at explaining the "problem". Then it would be interesting to compare that against what it reports for systems that don't see "sol" being so slow.

      - tye        

        Hmm. Comparing the sol to nl (since they seem that they should behave very similarly) shows this in the re=debug dump:

        Unless I'm misreading it, it looks that the sol version has to rescan the string each time for a potential start point, and for the location of the 'x'. The nl version doesn't seem to need to redo all that work every round. My (wild) guess is that the ANYOF class that is implied by the '^/m' makes the engine think it cannot keep as much state between matches.

        But this is really getting much deeper into the guts of perl than I'm really feeling comfortable guessing on.Maybe someone who's more familiar under the hood would be able to comment more?

        As an aside, if you study $str before starting (takes a split second), the benchmark changes significantly, and seems much less insane:

        Rate lb sol nl lb 162/s -- -87% -93% sol 1257/s 674% -- -45% nl 2272/s 1300% 81% --

        -- zigdon

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://572633]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2020-08-07 10:18 GMT
Find Nodes?
    Voting Booth?
    Which rocket would you take to Mars?

    Results (44 votes). Check out past polls.