in reply to Re (tilly) 1: Research ideas in thread Research ideas
You mean solved by the DFA right? Like egrep?
Sounds like a good idea :)
Ciao,
Gryn
Re (tilly) 3: Research ideas
by tilly (Archbishop) on Feb 20, 2001 at 05:23 UTC
|
No, I meant solved by an NFA.
Today NFAs can solve all problems that DFAs can solve, but
it is possible to by accident write an NFA that will take
several years to finish. My suggestion allows you to
convert an NFA into an equivalent NFA which backtracks at
fewer points. However it remains an NFA, and where an
NFA would find a different match than a DFA, it will find
the match that an NFA would find, and not the one that a
DFA would. (Thereby allowing it to be used in an NFA
engine without changing the behaviour.)
Now if the RE is one that you could use a DFA to solve,
then you can convert the NFA into an NFA with no
backtracking anywhere. However it still retains NFA
characteristics. For instance with alternation it will
prefer to match the first alternative in the sequence, and
won't match the second if that allows a longer overall
match.
Unfortunately optimizing the NFA that far may result in
a combinatorial explosion from n states to n!, but since
the optimization goes in steps, you can just optimize until
you hit some threshold, and then stop. (Or you could run,
profile, incrementally optimize, wash, rinse, and repeat.) | [reply] |
|
Ok's, I've gotchas now tilly.
I understand the difference between NFA's and DFA's, in fact I think
it would be nice if you could have a DFA-preference modifier in Perl on
your regex's so that if they are DFA-compatible, they would use the DFA
instead of an NFA. (or even as egrep does, -pretend- they are DFA
compatible, then check your answer with an NFA).
I'm no wiz with regex's though, but what you propose seems very,
very difficult! At least, it seems that if you told it to do even
something simple like: /vr(o*)*m/ (which if I'm thinking
correctly would cause a normal NFA to go ga-ga), you would be trying to
get the regex engine to realize first that perhaps we should either ignore
the double *, or (as it would be in more complicate situations) reverse
the processing order of the *'s, or even something more 'odd'.
And so my point is, even if you can get the regex engine to realize it
can do something to avert catastrophe, couldn't many of the 'solutions'
that the regex decides upon change which match the NFA chooses? That is,
changed from what the original badly written form asked for? The whole
intent on using an NFA is so you can choose optimized or taylored regex's
that either give you speed or choice in your matches (and sometimes both
:) ). But wouldn't this sort of optimization make which match the NFA
picks sort of, non-obvious? (in a more non-obvious than normal for regex's
way)?
Again, I'm not an expert regex'er, but it seems using a DFA in these
sorts of situations (even if a DFA can't be used directly), would be
similar -- you would avoid the pitfalls of killing the performance, but
you suffer in that maybe you're not getting the match you expected.
Well, I'm probably brain-fried right now, so don't pay any attention
to me :) .
Ciao,
Gryn
p.s. for those that are in the dark the main difference between a NFA
and a DFA is one of performance and control.
A DFA takes almost always the same amount of time to make a match, and
it will always make the same one for equivalent regex's (the longest of
the leftmost (er, or the otherway around, sorry)).
A NFA is the opposite, it may (often) be much fater than a DFA, but it
also may get stuck in a rut and finish sometime after the Sun burns out.
Besides the speed improvement, you often get a control advantage. That
is, for equivalent regex's (ones that would match the same thing, but
don't look exactly the same because of reordering or what-not) you can
control which match you want by rearranging your regex (which can't be
done for DFA's).
| [reply] [d/l] |
|
First of all you are right, the regex
/vr(o*)*m/ makes a naive NFA go gaga into an
infinite loop. In fact I believe that Perl used to break
on that exact case.
Contrary to a naive reading, this is not the same as the
match vr(o*)m, in fact it gives the same
result as /vro*()m/ (that is $1 winds up
matching an empty string before the m).
The solution currently used is to keep track of when you
start following the same logic during a zero-length match
and refuse to go down the same path a second time. So
before the m it will try to match another o, cannot, then
tries to go around the loop and match o*, succeeds, then
tries to go around the loop, detects recursion and fails
back to trying to match the m, and succeeds. (Thereby
finding your empty match.) A naive optimization will
get this wrong.
My proposed optimization has to be done carefully, but
will get it right. What I am suggesting is that you first
figure out all of the possible transitions depending on what
the next character is, in the order an NFA will visit them,
remove the duplicates, and then toss that into a
psuedo-state. In figuring out the possible transitions you
need to get the zero-length recursion right. However if
you do that correctly, then at no point does my optimization
change what match the RE engine will find. It just allows
you to "parallelize" the reasoning process and thereby
eliminate a possible spot to backtrack to.
Now for those who are horribly confused, grab Mastering Regular Expressions.
In a nutshell the basic problem is this. We have a
regular expression. We have a string. A DFA engine
(which stands for "discrete finite automata") essentially
walks the string once, keeping track on each character of
what all of the possible places it could be in the RE match.
This gives good guaranteed performance, but it makes it
hard to figure out what captured parentheses should be,
or to implement backreferences (which need to keep track of
more information than just where you are in the string and
the RE). An NFA engine just does a recursive brute-force
search of possible ways to try to match the string to the
RE. This allows for lots of features (for instance the
difference between .* and .*? is in whether you try to
match another . first or second) but you can get into
cases where there is an exponential explosion in how much
work needs to be done to recognize that a string does not
match. Perl, and most other scripting languages, use NFA
engines. A few use DFAs.
| [reply] [d/l] [select] |
|
|
|
|
A NFA is the opposite, it may (often) be much faster than a DFA,
I assume you mean this in the context of including init times? Excluding the DFA table build phase the DFA should be faster or equivelant to a NFA.
And I would absolutely welcome a DFA engine in addition to the existant NFA engine in perl. Especially if the DFA engine had a way of saving its state table. The possibilities are endless....
Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.
| [reply] |
|
|