Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re (tilly) 1: Research ideas

by tilly (Archbishop)
on Feb 20, 2001 at 04:49 UTC ( [id://59549]=note: print w/replies, xml ) Need Help??


in reply to Research ideas

Well I had an idea for how to optimize NFA RE engines that I think has potential, but I am not planning on doing any time soon. OK, it isn't directly Perl, but it does involve parsing and if it works well it would be appropriate for including in the RE engines of languages like Perl and Python. Take a look at this proposed optimization, and see if trying to implement and do basic tests on that interests you.

If it does, feel free to contact me privately. (Not that I have done anything with it other than some preliminary analysis...)

For the record the point is that, at the cost of extra work at compilation, it should allow any RE which can be solved by a DFA to be solved by an NFA without the exponential backtracking disasters that are traditionally problems for NFA engines.

Replies are listed 'Best First'.
Re: Re (tilly) 1: Research ideas
by gryng (Hermit) on Feb 20, 2001 at 05:05 UTC
    You mean solved by the DFA right? Like egrep?

    Sounds like a good idea :)

    Ciao,
    Gryn

      No, I meant solved by an NFA.

      Today NFAs can solve all problems that DFAs can solve, but it is possible to by accident write an NFA that will take several years to finish. My suggestion allows you to convert an NFA into an equivalent NFA which backtracks at fewer points. However it remains an NFA, and where an NFA would find a different match than a DFA, it will find the match that an NFA would find, and not the one that a DFA would. (Thereby allowing it to be used in an NFA engine without changing the behaviour.)

      Now if the RE is one that you could use a DFA to solve, then you can convert the NFA into an NFA with no backtracking anywhere. However it still retains NFA characteristics. For instance with alternation it will prefer to match the first alternative in the sequence, and won't match the second if that allows a longer overall match.

      Unfortunately optimizing the NFA that far may result in a combinatorial explosion from n states to n!, but since the optimization goes in steps, you can just optimize until you hit some threshold, and then stop. (Or you could run, profile, incrementally optimize, wash, rinse, and repeat.)

        Ok's, I've gotchas now tilly.

        I understand the difference between NFA's and DFA's, in fact I think it would be nice if you could have a DFA-preference modifier in Perl on your regex's so that if they are DFA-compatible, they would use the DFA instead of an NFA. (or even as egrep does, -pretend- they are DFA compatible, then check your answer with an NFA).

        I'm no wiz with regex's though, but what you propose seems very, very difficult! At least, it seems that if you told it to do even something simple like: /vr(o*)*m/ (which if I'm thinking correctly would cause a normal NFA to go ga-ga), you would be trying to get the regex engine to realize first that perhaps we should either ignore the double *, or (as it would be in more complicate situations) reverse the processing order of the *'s, or even something more 'odd'.

        And so my point is, even if you can get the regex engine to realize it can do something to avert catastrophe, couldn't many of the 'solutions' that the regex decides upon change which match the NFA chooses? That is, changed from what the original badly written form asked for? The whole intent on using an NFA is so you can choose optimized or taylored regex's that either give you speed or choice in your matches (and sometimes both :) ). But wouldn't this sort of optimization make which match the NFA picks sort of, non-obvious? (in a more non-obvious than normal for regex's way)?

        Again, I'm not an expert regex'er, but it seems using a DFA in these sorts of situations (even if a DFA can't be used directly), would be similar -- you would avoid the pitfalls of killing the performance, but you suffer in that maybe you're not getting the match you expected.

        Well, I'm probably brain-fried right now, so don't pay any attention to me :) .

        Ciao,
        Gryn

        p.s. for those that are in the dark the main difference between a NFA and a DFA is one of performance and control.

        A DFA takes almost always the same amount of time to make a match, and it will always make the same one for equivalent regex's (the longest of the leftmost (er, or the otherway around, sorry)).

        A NFA is the opposite, it may (often) be much fater than a DFA, but it also may get stuck in a rut and finish sometime after the Sun burns out. Besides the speed improvement, you often get a control advantage. That is, for equivalent regex's (ones that would match the same thing, but don't look exactly the same because of reordering or what-not) you can control which match you want by rearranging your regex (which can't be done for DFA's).

Re: Re (tilly) 1: Research ideas
by Anonymous Monk on Mar 21, 2002 at 05:24 UTC
    Are you sure you know the theory?
      I'm positive that I know the theory, and that my suggestion will work. Figuring things like this out is what I studied math for. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://59549]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-03-19 07:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found