First of all you are right, the regex
makes a naive NFA go gaga into an
infinite loop. In fact I believe that Perl used to break
on that exact case.
Contrary to a naive reading, this is not the same as the
match vr(o*)m, in fact it gives the same
result as /vro*()m/ (that is $1 winds up
matching an empty string before the m).
The solution currently used is to keep track of when you
start following the same logic during a zero-length match
and refuse to go down the same path a second time. So
before the m it will try to match another o, cannot, then
tries to go around the loop and match o*, succeeds, then
tries to go around the loop, detects recursion and fails
back to trying to match the m, and succeeds. (Thereby
finding your empty match.) A naive optimization will
get this wrong.
My proposed optimization has to be done carefully, but
will get it right. What I am suggesting is that you first
figure out all of the possible transitions depending on what
the next character is, in the order an NFA will visit them,
remove the duplicates, and then toss that into a
psuedo-state. In figuring out the possible transitions you
need to get the zero-length recursion right. However if
you do that correctly, then at no point does my optimization
change what match the RE engine will find. It just allows
you to "parallelize" the reasoning process and thereby
eliminate a possible spot to backtrack to.
Now for those who are horribly confused, grab Mastering Regular Expressions.
In a nutshell the basic problem is this. We have a
regular expression. We have a string. A DFA engine
(which stands for "discrete finite automata") essentially
walks the string once, keeping track on each character of
what all of the possible places it could be in the RE match.
This gives good guaranteed performance, but it makes it
hard to figure out what captured parentheses should be,
or to implement backreferences (which need to keep track of
more information than just where you are in the string and
the RE). An NFA engine just does a recursive brute-force
search of possible ways to try to match the string to the
RE. This allows for lots of features (for instance the
difference between .* and .*? is in whether you try to
match another . first or second) but you can get into
cases where there is an exponential explosion in how much
work needs to be done to recognize that a string does not
match. Perl, and most other scripting languages, use NFA
engines. A few use DFAs.