Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

RE: Regular Expression Optimizer

by BlaisePascal (Monk)
on Aug 31, 2000 at 01:23 UTC ( [id://30411]=note: print w/replies, xml ) Need Help??


in reply to Regular Expression Optimizer

First off, I would guess that for any regular language, there may be several "optimal" REs that recognise it, so if R is recognised by r1 and r2, the "optimised" r1' and r2' are not necessarily identical.

Second... Perl regexes are NP-Complete. Based on that, it is my -hunch- that optimization of Perl regexes is also hard. Note that Perl regexes are not regular.

Third (or, actually zeroth)... What do you mean by "optimal"? Shortest? Fastest? Easiest to understand?

Replies are listed 'Best First'.
RE (tilly) 2: Regular Expression Optimizer
by tilly (Archbishop) on Aug 31, 2000 at 02:09 UTC
    Actually I think it is NP-Hard. Determining that you have a match is polynomial. Determining that you have the match that you are looking for (which is the difference between .* and .*?) is another story entirely... :-)
      You may be right... I know that it's been shown that NP-Complete problems can be reduced (in P time) to a Perl regex match. But I don't think the proofs I've seen looked at .* vs. .*?. The constructions used backreferences.
RE: RE: Regular Expression Optimizer
by PsychoSpunk (Hermit) on Aug 31, 2000 at 02:59 UTC
    To address the zeroth point, if it were possible to have such a machine then optimal would be fastest. The idea is that I can write an easy to understand regex that isn't fast, and likewise I can write a short regex that isn't fast, but I want a regex that is fast whether or not it sacrifices length or understandability.

    For the first comment, I can see both sides equally. If the two suboptimal regexes are equivalent, then it should find two optimal regexes for each, since I figure this is a machine where I put one input in and get one output out. But if these two inputs are equivalent, then there should be a single "most optimal" regex and if there is a difference in the results from putting r1 and r2, then either one isn't fully optimized since there is a single "most optimal", or my argument has a flaw since I assume that there exists a "most optimal" state.

    Admittedly, some time has passed since I was knee deep in language theory, so I may have missed some very critical points that cause a regex optimizer to fail. So, here's a question: If heuristics are employed in the development of this machine (and mind you at this point, the REO is going to remain purely theoretical), does it doom the optimization from the point at which a heuristic is employed? Does this machine require that it can only operate on the input regex with algorithmic manipulations?

    And finally, for the second point, that just rules out the ability to do the problem within a certain context. ALL HAIL BRAK!!!

      I'm beginning to think that "fastest" may be an illusionary goal.

      The execution time of a RE is dependant on both the RE and on the string it is working on (I think... I might have to think some more...). As such, the measure we are interested in is not T(r), but T(r,s).

      I'll define an "optimal RE r for language L" to be an RE r that recognises L and T(r,s) <= T(r',s) for all r' recognising L and all s in L.

      It's possible that there is a language G, such that an r, r' recognising G and s,s' in G exist with the property that T(r,s) <= T(r'',s) and T(r',s') <= T(r'',s') for all r'' recognising G, but T(r,s)<T(r',s) and T(r',s')<T(r,s'). In otherwords, r is the fastest for s but not for s', and vice versa. Which is the "optimal" one for G?

      So "fastest" might not always be possible.

      As for the other point... Assuming that there is always an RE for a language matching my definition of "optimal", then why can't there be two? Why can't we have r != r' but T(r,s) = T(r',s) for all s? Then which should the optimizer return? Or does it matter? I could very easily see R, R' that optimize into r, r' respecively, for the same language. Your argument isn't flawed due to the believe in a "most optimal" RE, just that you believe there is a unique "most optimal" RE for any language.

      But it may be that in traditional REs (not Perl REs, which have...complications) the idea of a "fastest" RE may be nonsensical. The standard machines to match regular languages are DFAs, which have one state transition per input character. Therefore, the running time of any DFA is a function solely of the length of the input string, and not of the RE used to build it. "Optimal" DFA for a language usually refers to numbers of states, not speed.

      What do you mean by your question about the REO machine being able to make only algorithmic manipulations to the input RE? What other sort of manipulations were you thinking of?

        Actually it isn't so simple.

        First of all are we including compilation time in the performance? A DFA can be implemented with a complex analysis up front, but faster runtime, or by doing a simple analysis up front, and then interpret a bitmap of states at the end.

        The first can result in a compiled RE that is exponential in the size of the expression. The second cannot blow up at compile time that way, but the size of the RE definitely affects the runtime performance.

        In the real world, either can wind up being faster.

        Going beyond that, any smart RE engine (it would be safe to assume that Perl's is smart) can take advantage of the fact that long fixed substrings can be located and matched *faster* than you can scan a string. An RE engine that is able to notice then use such optimizations is very nice to have - and this is one of the reasons that simple Perl scripts can wind up beating custom C programs at text processing!

        (I would wonder if this is an example of a "non-algorithmic" improvement. :-)

        I was getting a bit disillusioned with the possibility of finding the fastest as well. However, as it's stated in Camel 2E, once the expression matches, the Perl Engine quits. So while in general, the execution of the RE is dependent on the RE and the string which it matches (For the sake of this argument, although not true in all cases in real life, I'm saying that anything that doesn't match takes up n amount of time. This is simply done since any failed string will take up the most time for any RE whether optimized or not), there will eventually be a regex such that, for every valid string in the language checked by the regex, it will perform optimally. It will not perform every execution in t seconds for every case, since it depends on the string; but of two equivalent REs (that share the exact same language), this optimized one will always execute faster than the suboptimal one. Fastest may be a pipe dream in all cases, but it could be possible to return the faster of all possibilities. At which point, we're now getting into the limitations of the implementation rather than anything else.

        I will concede that there may be a case where two equally optimal REs are available. This is entirely plausible, and I sometimes fall prey to a binary train of thought. Thus, while I've been reading replies to all of this, I've thought more as I go along, and since I did want to spur dialogue, I've let my thoughts wander down the ways everyone else is thinking to see how approachable this could be. Which of course leads me to your final two questions:

        Obviously, this whole discussion of language theory has mathematical rules under which it operates. I think everyone here is aware of what a regular expression optimizer runs up against simply by the fact that no one was able to say "Oh, you want to go look at x." So, my question was simply if heuristic manipulations would be allowed to the input RE. Basically, what if during implementation, a few "leap of faith" type heuristics were involved, but they also performed remarkably well. Does that affect the reception of my REO machine into the community since it doesn't always act as expected mathematically? ALL HAIL BRAK!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://30411]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-20 04:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found