Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: Inexplicably slow regex

by zigdon (Deacon)
on Sep 13, 2006 at 15:51 UTC ( #572765=note: print w/replies, xml ) Need Help??


in reply to Re: Inexplicably slow regex
in thread Inexplicably slow regex

I think the lookback is what is killing you here. The first two REs are anchored with a constant (^, \n), so they severely limit the possible starting point. The third has to look at every character in your huge file before it can decide if it's a valid starting point (if I understand how the perl RE engine works).

So in the benchmark above, the first two have to try matching about 10,000 times (since there are 10,000 valid starting points), while the last one has to consider 480,000 possible REs (since every character could be valid).

(For the interested, try looking at the output of -Mre=debug on each of these - it was quite interesting. Thanks to the kind monk on the CB that reminded me of it!)

-- zigdon

Replies are listed 'Best First'.
Re^3: Inexplicably slow regex (optimizer)
by tye (Sage) on Sep 13, 2006 at 18:10 UTC
    I think the lookback is what is killing you here.

    But the question is why is the "sol" version so much slower than even the "lb" version.

    I suspect it all boils down to optimization. All three cases could, in theory, anchor to "\n" characters but, in practice, the optimizer may not be smart enough to realize this.

    My interpretation (aka "wild guess") of GrandFather's numbers:

    Rate sol lb nl sol 0.316/s -- -99% -100% lb 35.0/s 10997% -- -92% nl 441/s 139470% 1158% --

    is that the "lb" regex is a bit more complex and so runs a bit slower while the "sol" regex runs so much slower that I'd expect it to be the one which is hitting way too many possible starting points rather than jumping to key spots such as "\n" (even more speed from Boyer-Moore probably doesn't apply here since I don't think any of these regexes are simple enough).

    Yes, I've been hoping someone would use -Mre=debug and summarize what it reported on a system that saw the "sol" regex being especially slow. I think that is the most likely route at explaining the "problem". Then it would be interesting to compare that against what it reports for systems that don't see "sol" being so slow.

    - tye        

      Hmm. Comparing the sol to nl (since they seem that they should behave very similarly) shows this in the re=debug dump:

      Unless I'm misreading it, it looks that the sol version has to rescan the string each time for a potential start point, and for the location of the 'x'. The nl version doesn't seem to need to redo all that work every round. My (wild) guess is that the ANYOF class that is implied by the '^/m' makes the engine think it cannot keep as much state between matches.

      But this is really getting much deeper into the guts of perl than I'm really feeling comfortable guessing on.Maybe someone who's more familiar under the hood would be able to comment more?

      As an aside, if you study $str before starting (takes a split second), the benchmark changes significantly, and seems much less insane:

      Rate lb sol nl lb 162/s -- -87% -93% sol 1257/s 674% -- -45% nl 2272/s 1300% 81% --

      -- zigdon

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://572765]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2019-05-22 12:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (140 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!