|Problems? Is your data what you think it is?|
Re: Analysis of Regular Expressionsby rubasov (Friar)
|on Mar 26, 2010 at 03:21 UTC||Need Help??|
This post is a little late, but I've just came across Regexp::Compare and remembered your OP. Putting aside the theoretical complaints on complexity consider the following code:
In the output the most specific regex comes first and the most generic comes last. If you put more than one equivalent regexes into your input then you get a nice warning on loops:
I've used topological ordering because the more-specific relation on regexes is not a total ordering (I hope I'm using the correct math terms), however this still guarantees that a more specific item comes earlier therefore you won't get the "false" (more generic) match first.
As a sidenote: while playing with this, I've realized that your sample regexes are a little "dirty". Let's take \AHi and \b(Hi(ya)?|Hello|Greetings)\b for instance. None of these two is more specific than the other, because none of the sets of every possible match is a subset of the other. (Consider the strings His and Foo Hi.)
Of course Regexp::Compare uses heuristics, I suppose something along like this:
But I cannot tell how well it performs for real regexes.
I hope this is useful for you.