http://www.perlmonks.org?node_id=436068

anthonywatson has asked for the wisdom of the Perl Monks concerning the following question:

Say I have a regular expression like:
A[CT]A[CT]
This can match 4 possible strings ACAC, ACAT, ATAC, ATAT, For any given regular expression I would like a script that lists all possible strings it could match. Such a script would have utility in "back translating" amino acids to proteins, or DNA sequences with SNPs to possible exact sequences. I do not see an easy way to accomplish this, and have to think the problem has been solved 100's of times. Thanks in advance for help

Replies are listed 'Best First'.
Re: back translating regular expressions
by Roy Johnson (Monsignor) on Mar 03, 2005 at 04:27 UTC
    Let's just change your notation slightly:
    my $str = 'A{C,T}A{C,T}'; print "$_\n" for glob $str;

    Caution: Contents may have been coded under pressure.
      ++

      glob is great for the simpler regex expansions. I'm not sure how many regex features you need...You'll probably hit the wall somewhere before zero-width (posi|nega)tive look-(ahead|behind)s.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: back translating regular expressions
by Tanktalus (Canon) on Mar 03, 2005 at 04:15 UTC

    As a general case, or as a specific case of just the brackets? If it's a relatively specific case of just brackets, you should be able to adapt String expansion script - I think that my solution in this thread is particularly well-suited to this special case, but you could say I'm biased. :-) Actually, more to the point is that I have a certain tool in my toolbox, and that influences how I see the problem: as a specific type of nail. Think of a callback like this:

    my @possibilities = expand_string("A[CT]A[CT]", sub { split //, $_; });
    Of course, you'd have to change expand_string to use brackets rather than parenthesis and braces as it does in that node.

    For the absolute general case, I would have to say that * and + would kill you. Or at least kill whatever program tried to calculate it. So you'd have to define your minilanguage as a subset of regular expressions that did not include unlimited-length wildcards. Even using the "." character is painful - again, some definition would be required about the actual range of this - you probably don't want all of unicode.

    But, it would be an interesting problem to solve - to see what type of limitations would need to be placed on the regular expression to allow for this to allow the regular expression to be solved in this way, and to see the actual solution.

Re: back translating regular expressions
by Fletch (Bishop) on Mar 03, 2005 at 02:07 UTC

    ". . . and then I fed it 'a.*b' and my computer just locked up." See also "Halting Problem, The". :)

    Having said that, if you can wait a few more weeks until Higher Order Perl (ISBN 1558607013) comes out I believe one of the stream generator examples is generating possible matches for a regex. That lets you generate matches one at a time in a manner where you can stop at any time.

      What on earth does that have to do with the halting problem? A computer going into an infinite loop is not the halting problem. If you meant that you can't tell whether a regular-expression denotes an infinite language, that's not quite right either. It's infinite if and only if it contains an infinite quantifier (+, *, {m,}), which is an easy property to check.

      blokhead

        /a/
        does not contain a quantifier, but there are an infinite amount of strings it will match. And if you only want to consider anchored strings:
        my $q1; $q1 = qr /a(??{$q1})?/; my $q2 = qr /^$q1$/;
        matches all strings containing nothing but a's. But it doesn't use any of the quantifiers you mentioned.

        Also,

        /^(?!)*$/
        contains a * quantifier, but it doesn't actually match any string. Perhaps you want to discount lookahead. Then I give you:
        /^(|)*$/
        It'll give a warning, but all it matches is the empty string. But it does have a * quantifier.
Re: back translating regular expressions
by manav (Scribe) on Mar 03, 2005 at 05:34 UTC
    Im not sure how reliable this is but you could try Regexp::Genex....

    Manav
Re: back translating regular expressions
by perlfan (Vicar) on Mar 03, 2005 at 18:47 UTC
    If you want to do this from scratch, you'll have to first parse regex using some grammar. You can use Parse::RecDescent, but make sure the grammar is right recursive. Once you get the parse tree, you can construct possible matches. If you want to handle kleene closures (*), then I would only generate the possibilies where there is no S and 1 S - don't go down the S* road unless you had to.

    This is a very interesting problem, and I am currently working on a set of modules that let you play around with finite automatas for grad school. I have an NFA->DFA going, and I am about to do a RE->DFA; one of my additional goals is to iteratively generate a set of strings that can be produced given a DFA or RE.