http://www.perlmonks.org?node_id=284513

bsb has asked for the wisdom of the Perl Monks concerning the following question:

Is there a module which can generate strings that satisfy a certain regular expression?
qr/^ab?(c|d)$/ => qw/ac abc ad abd/;
My only lead is
$ perl -e 'use re "debug"; /^ab?(c|d)$/'
which I got from GraphViz::Regex, which parses the output to graph it.
Anything else I should look at?

Obviously there's issues about infinite languages, what does '.' do, unanchored expressions, perl's (?..) features. But let's be "Can Do" people :)

Brad

Replies are listed 'Best First'.
Re: Regexp generating strings?
by larsen (Parson) on Aug 18, 2003 at 09:32 UTC
    Another module that could help you getting an idea of the set represented by a RE is YAPE::Regex::Explain.
      The YAPE::Regex family looks great for thorough regex parsing. Thanks.
Re: Regexp generating strings?
by gjb (Vicar) on Aug 18, 2003 at 09:45 UTC

    You may want to have a look at String::Random. It doesn't do full regular expressions, but at least it generates random strings satisfying a "pattern" that can be a subset of real Perl regular expressions.

    Hope this helps, -gjb-

Re: Regexp generating strings?
by Abigail-II (Bishop) on Aug 18, 2003 at 09:34 UTC
    Is there a module which can generate strings that satisfy a certain regular expression?

    No.

    That's a very hard question. For many types of grammars, this question is known to be non-solvable. IIRC, this includes regular expressions (the real ones, not the Perl ones). Perl regular expressions are hard to categorize, but they are certainly a superset of classical regular expressions.

    This of course doesn't mean it's not possible for specific regular expressions.

    Abigail

      Warning: Theoretical but impractical stuff follows: I think for plain regular expressions (not the Perl stuff), it is possible to check whether a RE matches any string, and in the process of doing that, you (can) come across a lot of matching strings :

      • Any regular expression can be written as a finite deterministic automaton.
      • Every deterministic finite automaton (DFA) either accepts no word at all or at least one word that is equal or less the number of states it has.This theorem has some name I forgot...
      • If a DFA accepts a word of a length greater than the number of states, it has a loop.

      If all of the above are true (and I remember them being true), then you "only" have to enumerate all strings from the alphabet from length 0 to the number of states of the DFA representing your RE. After that, you can go up to 2* number of states of DFA, to see whether it has loops.

      The number of states in a DFA does not directly relate to the number of characters in your RE, if I remember correctly, it can be up to 2^n with n the number of characters from your alphabet in your RE (that's a rough number, as I equate the NFA with the RE here).

      This means that, for nontrivial REs, you will have a large space of strings to search. It might be easier to randomly generate strings and check whether they match, maybe guided by some heuristics, like alphabetical chars within the RE.

      perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      For Perl regexes, its seems rosy until you reach look-aheads. /^(?=$rx1)$rx2/ needs to find the intersection between the two languages defined but $rx1 and $rx2, if it exists. Without the ^ anchor $rx1 is floating in $rx2, it becomes substring instead of intersection.

      It may be solvable, but not by me.

Re: Regexp generating strings?
by CombatSquirrel (Hermit) on Aug 18, 2003 at 16:17 UTC
    I liked the problem and so I tried to come up with a solution to it. Because I didn't like the idea of having to re-invent the Perl RegEx parser completely, there are a number of limitations to my program:
    Only the following are allowed:
    • literal characters
    • capturing parens
    • OR (|)
    • the following quantifiers in greedy forms:
      • ?
      • {x,y} only with x and y specified
    This means especially that the following are not allowed:
    • character classes, including ISO ones
    • escaped characters
    • star and plus
    • the almighty dot
    • lookahead and lookbehind
    • and many others...
    And the code is, of course, not optimized, and I am not copletely sure whether it is completely bug-free. Any comments are welcome and I would also be highly interested in a Perlgolf version of this one :-). Well, here it is:
Re: Regexp generating strings?
by bsb (Priest) on Aug 27, 2003 at 08:53 UTC