http://www.perlmonks.org?node_id=829117

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

this one is certainly a little difficult/hairy/tricky and I'm aware, that there probably can't be a definitive answer. In fact, I had a hard time of even deciding whether to put this into Seekers or Meditations.

I'll need a function that will return either a "genericity" - or - "specifity" of a regular expression. The idea is, that according to a measure of "specifity", the following regular expressions are sorted from the most specific to the most generic:

1) \AHi\z 2) \AHi 3) Hi 4) \b(Hi(ya)?|Hello|Greetings)\b 5) (Hi(ya)?|Hello|Greetings) 6) .*

So in other words, a regular expression is more specific than another, if it "matches less" than that other rx. (well - and there's the problem - it is questionable if 4) isn't more specific than 3) )

Any creative ideas how to achieve this? Oh - yes and the computation of the "genericity/specifity" should be fast. The best I could come up so far was to dissect each regular expression with other regular expressions and adding/subtracting "weight" at the occurence of different metacharacters with very limited success so far.

So if anyone has a good idea or maybe a pointer to something similar that has been done before, I'd love to see that.

Bye
 PetaMem
    All Perl:   MT, NLP, NLU

Replies are listed 'Best First'.
Re: Analysis of Regular Expressions
by Ratazong (Monsignor) on Mar 17, 2010 at 10:04 UTC

    Three possibilities on creating a regex-generality-rating come into my mind:

    1. generate all strings of a given length L and count how many are matched by the regex
    2. use the regex to calculate the number of all possible strings (up to a given length?!)
    3. do some inexact estimation

    1. would be easy to implement (generate all strings, match) - but is probably much too slow (just counting the letters and numbers would give you 26+26+10 different characters at each position of your string ...
    ... so this approach is not feasible :-(

    2. is more tricky - you need to analyze the regex, but this is possible (see the previous answers, or here for some tools for this.)

    I would then calculate two ratings for this and combine them in a suitable way (however suitable looks like ;-)

    • a freedom rating:
      • each time you encounter some freedom (e.g. a-z you multiply the rating by the # of possibilities (26 in this example)
      • this covers stutations like OR, ranges, numbers of occurences {2-5}, ? ...
    • a fixed rating:
      • this counts the number of fixed characters in the regex
      • I would use this to distinguish between "Hi" and "Hello" - having both the same "freedom", but "Hello" is more specific
    However I see some problems with this approach:
    • how to "tweak" the formula to identify that all the following are the same (and have the same rating)
      Hi = ^.*Hi = ^.*.?Hi = .*Hi.* = ^.*.?.*Hi.*$
      you might overcome this by adding "^.*" and ".*$" to your regexes before analysis ...
    • how will this work with complex regexes (e.g. lookaheads/lookbehinds or even (?{ code }) )
    • ...

    3. If I had the task you describe, I'd first check my possible input-regexes (maybe most of them are just simple?) and check if there is an issue if some are not ordered correctly. Probably it is possible to create a simplified formula that works well in most cases ...

    HTH, Rata (very interested on how you sort this problem out!)

      1. would be easy to implement (generate all strings, match) - but is probably much too slow (just counting the letters and numbers would give you 26+26+10 different characters at each position of your string ...

      Only 62 letters and numbers? That's so limiting.... That doesn't even cover most European languages. Currently, Perl has thousands of characters that are considered letters and digits.

        Or any of the official North American languages. It doesn't cover English (naïve), French (fête) or Spanish (señor).

      Rata,

      ok, it's clear that solution 1) has exponential runtime and space requirements (size of the charset doesn't even matter - (well until JavaFan points out that a charset of 1 char would...) )

      I basically went for solution 2) before, but only with some half-baked heuristic constructs, so probably a more high-level description of the problem is in place: I'm working on a chatbot demo which has to catch some phrases.

      Thats a simple pattern matching just before any parsing, and delivers decent speed. And the whole point of this discussion and my endeavour is, that the set of the regular expressions gets extended and it is not feasible (at least I haven't come up with a feasible way) to sort them by "do-what-i-mean" manually.

      Doing a trivial "sort by length" has actually worked very well for a set of up to 500-600 rules. But now I start getting false matches ONLY because some "more generic" regular expressions are handled before the more specific ones.

      Under the assumption, that (a|b|c) is more generic than (a), it is clear, that the simple "by length" ordering will result in the wrong order, because I want to process the "specific" rules first, and the "generic" last.

      What has helped:

      • removing e.g. named matches from length considerations, making e.g. (?<name>...) 9 chars shorter
      • Having an alternation like (a|b|c) being considered as just having length of 1, same for character classes
      • actually subtracting a length of 1 for every wildcard.

      Maybe the best solution could be to give every rule an identifier (which has happened anyway), and add some information "try this before ruleX". Which is starting to look like a parser. :-)

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

        Hi PetaMem!

        Thanks for that additional information! Chatbots are a really interesting topic!

        Looking at your specific problem (and not the general generalicity of regexes) makes me wonder if the empirical approach (do a check on how many possible strings are matched by the regex) really doesn't work. However with some modifications:

        • Your problem doesn't seem to be real-time - and the order of your rules seem to be static:
          therefore you could do the generalicity-rating in some offline preprocessing phase, e.g. by letting your computer work on it all night/weekend/holliday
        • Instead of looking at all possible input strings, just narrow it to the expected data. You will probably have tons of chat-logs which you can use
          So your rating could be: how many matches are found by a regex in a given set of logs

        Regarding the high number of possible characters (as also mentioned by JavaFan): Here you could do some preprocessing, e.g. by replacing the german A-Umlaut by ae. This will also help with some other "languages" like leetspeak ... your chatbot seems to get confused when greeted by a friendly "h3110" ;-)

        However I fear that any automatic ordering would just be one criteria for determining the order of the rules. You will probably add additionally some rating done by a human. And the idea of giving additional context to the rule (like you wrote: try this before ruleX ...) sounds great to me. Have you also experimented with randomness? (Using a random order in case of several rules have a similar rating.) That way the answers might not be so predictable...

        HTH, Rata

      > # use the regex to calculate the number of all possible strings (up to a given length?!)

      you know it's perfectly possible to construct valid regexes which won't terminate before all suns have burned out...

      So your empirical approach is only practicable by limiting the regexes to a reasonable subset.

      One thinkable way could be limiting to pure state machines and exploring feasible transits between states.

      But this is highly speculative without a detailed question from the OP, IMHO it's a case of XY!

      Cheers Rolf

Re: Analysis of Regular Expressions
by LanX (Saint) on Mar 17, 2010 at 09:50 UTC
    I have to admit that I don't fully understand what you want ... (I'm not even sure you do... ;-)

    This sounds like an automatic analysis of programs, which is often proved to be impossible.

    Anyway maybe the discussion in this thread "Regexp generating strings?" and this module "Regexp::Genex" can be of help for you.

    Cheers Rolf

Re: Analysis of Regular Expressions
by JavaFan (Canon) on Mar 17, 2010 at 10:56 UTC
    So in other words, a regular expression is more specific than another, if it "matches less" than that other rx.

    Well, apart from the fact that this question isn't even well formulated (for instance, all the given regexes, except the first one, match an countable, infinite set of strings - and you cannot order such sets on size), I doubt the question is even computable. I don't think even if the simpler question "given two (Perl) regexes, determine whether they match the same set of strings" is computable.

    Oh - yes and the computation of the "genericity/specifity" should be fast.
    Hahahahaha. It's a little early for April 1.
      > > Oh - yes and the computation of the "genericity/specifity" should be fast.

      > Hahahahaha. It's a little early for April 1.

      LOL, JavaFan++ =)

      Cheers Rolf

        Uh.. boys... (JavaFan and LanX) gimme a break...

        I mean yes, I got my master in CS 13 years ago and there's no need to remind me of Gödel, the halting problem and similar.

        And YES, you are theoretically right. And NO, you are practically (especially JavaFan) way off. There are no countable infinite sets on finite computers, and if you have a FINITE SET of input strings, the cardinality of the sets that represent the results of the presented regexs may very well differ and thus be sortable. So your ivory tower arguments are mostly for the dustbin...

        I know, that for a Java Fan it might be very hard to adopt the Perl-inherent DWIM scheme when interpreting questions, but why LanX jumps on that bandwaggon is beyond me.

        Fotunately I got already some interesting hints in this thread and am working on a SOLUTION. Yes - will be presented for discussion.

        Bye
         PetaMem
            All Perl:   MT, NLP, NLU

Re: Analysis of Regular Expressions
by Anonymous Monk on Mar 17, 2010 at 09:09 UTC
Re: Analysis of Regular Expressions
by jffry (Hermit) on Mar 17, 2010 at 20:24 UTC

    Assumptions:

    • You have access to a large subset of the files that these regular expressions will be used against.
    • OR
    • If these regular expressions are simply being ranked in the abstract, you have some means of constantly accessing a variety of real-life sample files.

    Write a daemon that continuously runs each regular expression against an ever growing list of files. The daemon updates a table, and each row contains 2 columns: the regular expression and the average line count.

    Your output program simply sorts the table on the average line count column. So it is quick in that regard. However, as the daemon runs each regular expression against more and more files, the ranking may change.

    Obviously, newly added regular expressions will have a more volatile rank compared to older ones that have been run against thousands of files. To combat this problem, you could determine a minimum file comparison quantity before the regular expression shows up in the table. For speed, you could have the daemon give priority to newly added regular expressions until their rank stabilizes. In fact, these two points should be configurable as tuning parameter of the daemon.

    What I like about this approach is that it throws all the theoretical junk out the door. Brute force can be ugly, but then again, the map is not the territory, and brute force reveals the territory.

Re: Analysis of Regular Expressions
by rubasov (Friar) on Mar 26, 2010 at 03:21 UTC
    This post is a little late, but I've just came across Regexp::Compare and remembered your OP. Putting aside the theoretical complaints on complexity consider the following code:
    use strict; use warnings; use IPC::Run3; use Regexp::Compare qw( is_less_or_equal ); my @re = ( qr/asdf/, qr/asd/, qr/as/, qr/a/, qr/./, qr/.*/, ); my $stdin = ''; my $stdout = ''; my $stderr = ''; for my $i ( 0 .. $#re ) { for my $j ( 0 .. $#re ) { $stdin .= "$i $j\n" if is_less_or_equal( $re[$i], $re[$j] ); } } # tsort (GNU coreutils) 8.4 run3 [ 'tsort' ], \$stdin, \$stdout, \$stderr; print $stderr; my @idx = split /\s+/, $stdout; print "$_\n" for @re[@idx];

    In the output the most specific regex comes first and the most generic comes last. If you put more than one equivalent regexes into your input then you get a nice warning on loops:

    my @re = ( qw/./, qw/./, qw/.+/, );
    in $stderr: tsort: -: input contains a loop: tsort: 0 tsort: 2 tsort: 1 tsort: -: input contains a loop: tsort: 0 tsort: 2 tsort: -: input contains a loop: tsort: 1 tsort: 2

    I've used topological ordering because the more-specific relation on regexes is not a total ordering (I hope I'm using the correct math terms), however this still guarantees that a more specific item comes earlier therefore you won't get the "false" (more generic) match first.

    As a sidenote: while playing with this, I've realized that your sample regexes are a little "dirty". Let's take \AHi and \b(Hi(ya)?|Hello|Greetings)\b for instance. None of these two is more specific than the other, because none of the sets of every possible match is a subset of the other. (Consider the strings His and Foo Hi.)

    Specifically GNU tsort is probably a better choice than Sort::Topological as it does not die on cyclic graphs.

    Of course Regexp::Compare uses heuristics, I suppose something along like this:

    hypothesis: $re1 <= $re2 is_less_or_equal( $re1, $re2 ) specific -> $re1 $re2 <- generic if exists $str that $str =~ /$re1/ and $str =~ /$re2/ -> no consequence $str =~ /$re1/ and $str !~ /$re2/ -> falsifies hypothesis $str !~ /$re1/ and $str =~ /$re2/ -> corroborates hypothesis $str !~ /$re1/ and $str !~ /$re2/ -> no consequence
    But I cannot tell how well it performs for real regexes.

    I hope this is useful for you.