Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^4: Perl regexp matching is slow??

by demerphq (Chancellor)
on Feb 02, 2007 at 14:06 UTC ( [id://597941]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

Er, am i looking at the right thing? I dont see a graph covering construction time. I see a graph covering total run time but not one covering construction alone.

Id be willing to look into hacking your code into the regex engine as an experiment. You'd have to release the code under the PAL though.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^5: Perl regexp matching is slow??
by rsc (Acolyte) on Feb 04, 2007 at 02:49 UTC
    Yes, you're looking at the right graph. If you look at the left hand side, that's running a match on a single-byte string, so essentially all the time is in construction. Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

    What I said before is still true -- you're only parsing the regex, which is all that you do for a backtracking implementation too.

    I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it -- it's nowhere near parsing Perl's regexes. It doesn't even parse full egrep regexes. It's really only meant to be a proof of concept.

      Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

      Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

      I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it

      Thanks thats really great. Ive already started work on beefing it up some to handle the Perl syntax and Ive completed some changes that turn it into a floating matcher, which I include below.

      Regarding whether we want it, of course we do. :-) For our purposes of trying to shoehorn this stuff into the existing engine its easier to start with a minimal proof of concept than a fullblown engine. Somebody who wants that should do it as a plug in engine replacement. Me, im more concerned with making the out the package engine better and therefore a less polished and simpler framework is in many respects easier to deal with.

      Id really like to be able to swap it in on the fly. Alhough TimToady pointed something out that gives me some pause, which is that normal Perl alternation matches leftmost-longest. Which effectively means that without extra syntax (which itself would mean changes to the normal engine as well) we can't swap it in when there is alternation in place. And to make the backtracing engine do longest-token would mean trying every alternation, so adding syntax like (?|...|...) which would be longest token alternation, would mean that if thompsons algorithm was /not/ swapped in the resulting regex would be slower. (Of course if we could detect that the DFA and NFA would match different things then this wouldnt be an issue, so maybe thats the right approach.)

      Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

      Anyway, here is a modified match routine for dfa1.c that does floating matches. Note this requires extending the Dstate struct to know that it has accepted.

      Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

      ---
      $world=~s/war/peace/g

        I changed the files posted on my web site to allow redistribution under the MIT license. Feel free to move the terms back to the top of the file if you prefer; I like being able to see the source code when I open a file instead of a license, but I know I'm in the minority.

        Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

        I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

        Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

        I mentioned the Plan 9 regexp library in connection with Unicode mainly because it was the first to support Unicode and UTF-8, and because it does so fairly simply. True, it does not implement things like german-sharp-ess rules. You could easily change the loop in regexec.c to take the next UTF-8 sequence if one is there or else take a single byte, if that's what Perl does. I don't know what Perl's semantics here are.

        Anyway, here is a modified match routine for dfa1.c that does floating matches.

        It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer). If the underlying NFA has m states, then in general you need to track m possible start pointers. If the code you posted searches for aaab in the text aaaab, it will not find a match: it will read the aaaa and then start over at b.

        Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

        True, but you can portably index into it if you use char & 0xFF, which is what the code does.
      Yes, you're looking at the right graph. If you look at the left hand side, that's running a match on a single-byte string, so essentially all the time is in construction.

      Unless I missed something looking at the timing scripts (.tar.gz), you're comparing the run-time of entire programs -- e.g. perl/ruby/etc programs -- versus your compiled C-code. That doesn't seem like a useful comparison of regexp algorithms.

      Are you really measuring the load time of Perl, the compilation time of the perl regexp test program and its execution and comparing that to the run time of your NFA code that is just a couple hundred lines of C?

      The other test programs all seem to be interpreted or call sh to execute a command line tool. Whereas your README suggests that the comparable nfa test program is a compiled C executable being run directly.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        You're right. I am timing entire programs.

        But I am timing one program running enough iterations of the match to take 10 seconds worth of CPU time, up to a max of 10,000 iterations. If you assume Perl starts in less than 10ms (it loads in well under that on my machine), then the extra time added by startup is no more than 1 microsecond in the final numbers. Also, the point was really the difference in growth rates: the 60+ seconds for backtracking isn't being spent during program load. (And PCRE is a C program too.)
      Please do release it (and I think we all understand what proof of concept code sometimes looks like, so don't be embarrassed). There's a good possibility someone else will take it and run with it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://597941]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-23 21:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found