go ahead... be a heretic | |
PerlMonks |
Re: Analysis of Regular Expressionsby rubasov (Friar) |
on Mar 26, 2010 at 03:21 UTC ( [id://831020]=note: print w/replies, xml ) | Need Help?? |
This post is a little late, but I've just came across Regexp::Compare and remembered your OP. Putting aside the theoretical complaints on complexity consider the following code:
In the output the most specific regex comes first and the most generic comes last. If you put more than one equivalent regexes into your input then you get a nice warning on loops:
I've used topological ordering because the more-specific relation on regexes is not a total ordering (I hope I'm using the correct math terms), however this still guarantees that a more specific item comes earlier therefore you won't get the "false" (more generic) match first. As a sidenote: while playing with this, I've realized that your sample regexes are a little "dirty". Let's take \AHi and \b(Hi(ya)?|Hello|Greetings)\b for instance. None of these two is more specific than the other, because none of the sets of every possible match is a subset of the other. (Consider the strings His and Foo Hi.) Specifically GNU tsort is probably a better choice than Sort::Topological as it does not die on cyclic graphs. Of course Regexp::Compare uses heuristics, I suppose something along like this: But I cannot tell how well it performs for real regexes. I hope this is useful for you.
In Section
Seekers of Perl Wisdom
|
|