http://www.perlmonks.org?node_id=809842

This was originally intended to be a CUFP, but I thought that the story behind the problem was at least as interesting as the solution; and, the more I wrote of the story, the more meditative it seemed. (I think that, even if the specific code doesn't catch your attention much, at least one of the linked topics might!) I hope that this was the right place for it to end up.

The question has (implicitly) rung out, on Perlmonks, Proggit, and, one must presume, the haunted nightmares of many: Are Perl regexes Turing complete?

This question caught the eye of yours truly, except that I confused the Turing completeness of regexes with the Turing completeness of substitutions, which led to an apparently unmotivated question and which required ikegami to set me straight.

Now I think that I've got the question: Can Perl regexes be used to recognise an arbitrary recursively enumerable language? The stirring answer: No, sort of *.

Why not? Well, a regex can be simulated by a linear-bounded automaton (that's the hyphenation, and I'll stick by it), and these can recognise only the context-sensitive languages (not all the recursively-enumerable ones, as a Turing machine can). Another way to say this is that regexes can only recognise languages in NSPACE(n). Thought of in this way, it's not so surprising: A regex has finitely many capturing groups, say k of them, each of which can be stored in space equal to the length, n, of the input (since they will, after all, only contain substrings of the input); so we only need space k*n to perform the ‘super-DFA’ part of Perl regex matching **. I suspect this upper bound is far too from sharp, and that Perl-5.8 regexes can't recognise balanced parentheses, which are only context free. What if we fix this by upgrading to Perl-5.10 regexes, which allow recursion? Well, then we can obviously at least recognise balanced parentheses, but I suspect that we're still not Turing-complete—indeed, still only context sensitive. Maybe someone more versed in CS than am I can confirm this.

Well, so how about the question which I understood in the first place: Can Perl substitutions be used to perform any computation that a Turing machine can, viewing the string on which they act as the input tape? Again: No, sort of *. After all, a substitution consists just of a match—which can be completed by an LBA, as above—followed by the assembly of the substitution string—which, again, involves only constant overhead (the literal, or constant, part of the match) over a constant multiple of the input (the part coming from interpolation of capture variables—any other variables are not affected by the match, and so may as well be regarded as constants). Thus, substitutions can still handle only CSLs.

Well, that's a little anti-climactic. So, what is true? Well, recursive Perl substitutions are Turing complete. As I pointed out elsewhere in the linked thread, that's known; but, hey, if I can figure out in a month's work what's obvious to MJD, then I figure that I'm doing all right. :-)

OK, so, that was a post full of negatives followed by a bold assertion; let's substantiate it. The obvious way to show Turing completeness is to simulate a Turing machine, but, man, that's been done to death. I heard the plea of cdiggins on LtU, who wondered why everyone prefers the λ calculus over the SKI combinator-calculus, and determined that I would not be among those hip and trendy kids who think only of the former. I quickly came up with:

perl -pe ' BEGIN { my $term = qr/(S|K|\((?-1)(?-1)\))/ } 1 while s/\(\(\(S($term)\)($term)\)($term)\)/(($1$3)($2$3))/ or s/\(\(K($term)\)($term)\)/$2/; '
which does a great job of reducing (fully parenthesised) terms from the calculus ****, but is a little unsatisfying because it requires recursive regexes, which I've just said that I don't know aren't already Turing complete. All right, so, moving on.

Well, we know that Conway's game of life is Turing complete, so maybe that's a good place to start; but it's played out on a 2-dimensional grid, and translating that 2-dimensional grid into the 1-dimensional ‘tape’ for a regex seemed to be a little hairy. So, a 1-dimensional version—well, that sounds like cellular automata. Everybody's favourite new kind of scientist, who studied these extensively, has a favourite—which we know, thanks to Matthew Cook, is Turing complete. So, let's simulate that! The problem is that there are 8 different rules for cell propagation, and running 8 different substitutions seems sort of profligate; can we get it down to 2, as we did for the SKI combinator-calculus above? If (as I did) we get stuck on that, can we at least show that, if we could do it, we'd have got an optimal result—that is, that a 1-regex substitution can never be Turing complete? I spent a few sleepless nights on this; I was sure that it wasn't possible—after all, one of Turing's criteria for a universal computer is that it should be able to make choices; but our sub-Perl-5.10 regexes can't, so we should have to build in at least one choice “manually”, on the level of multiple regexes. How to prove it, though? I thought that I could come up with some argument from Kolmogorov complexity, but no luck—because, it finally occurred to me, I was guilty of the same blinkered thinking as the designers of the early, “fixed-program” computers, and was trying to shove the entire program into the architecture. Why not instead let the program be recorded in the data?

Out of that idea came the following code for implementing rule 110 (or, rather, rule 124, which propagates to the right instead of the left):

perl -Mre=eval -nE <<CODE chomp; $_ = <<RULE . "---\n$_"; ^111 0^11 ^110 1^10 ^11 1^1 ^01 1^1 ^10 1^0 ^00 0^0 ^1 1 ^0 1 1^1 0 0^0 RULE BEGIN { my $nc_rule = qr/^ *\S+ *\S* *$/m; my $c_rule = qr/^ *(\S+) *(\S*) *$/m; } /\^.*$/ or say( ( split /\n/ )[-1] ) while s/$c_rule(?:\n$nc_rule)*\n---\n.*?\K\1/$2/; CODE
(The here-doc syntax is fake—at least for me, under bash, perl complains No code specified for -e—but I assume that someone out there knows how to fix it.) This accepts input a string of 0s and 1s, which is regarded as infinitely padded with 0s on both sides, and prints out its successive evolutes under rule 124. For example, here's a sample run, showing the evolution of the input 1 (which stands for ...0001000...):
1 11 111 1011 11111 100011 1100111 11101011 101111111 1110000011 10110000111 111110001011 1000110011111 11001110100011 111010111100111 ...
(It goes on forever, so don't wait for the surprise finish. :-) See the list of projects below!)

Emboldened by this success, I thought that perhaps I could tackle the original problem of reduction for the SKI combinator-calculus. To keep things simple (well, for me), I decided to use the Unlambda syntax, whereby ` is used as the notation for application, so that (SK) becomes `SK. I also wanted to include support for variables; again to keep life simple, I stuck with the austere CSer's set x, x', …. The code is as follows:

perl -Mre=eval -nE <<CODE chomp; $_ = <<RULE . "---\n$_"; + ^'>'+ ^ + ^> ^ . ^`>`.. ^ . ^S>S ^ . ^K>K ^ . ^x>x+ ^ - ^'>- ^ - ^> ^ ! ^`>!! ^ ! ^S> ^ ! ^K> ^ ! ^x>- ^ , ; ^'>', '; ^ , ; ^> ^ @ * ^`>`@@ `** ^ @ * ^S>S S ^ @ * ^K>K K ^ @ * ^x>x, x; ^ ^> ```S>``.@`.*^ ``K>.!^ RULE BEGIN { my $vs = qr/[^ >\n]*/; my $ss = qr/ *?/; my $c_rule = qr/(?>($vs)$ss($vs)$ss($vs)$ss>$ss($vs)$ss($vs)$ss($v +s)$)/m; my $nc_rule = qr/(?>(?:$vs$ss){3}>(?:$ss$vs){3}$)/m; } 1 while s/$c_rule(?:\n+$nc_rule)*\n---\n.*?\K\1(.*?)\2(.*?)\3/$4$7$5$8$6/; say ( ( split /\n/ )[-1] ); CODE
This accepts strings according to the grammar
t ::= S t ::= K t ::= xp p ::= 'p p ::= t ::= `tt
where t is the start symbol; p is another non-terminal; and S, K, x, ', and ` are terminals. It then reduces them according to the rules of the SKI calculus. Here is a sample run, showing that it correctly reduces the basic combinators in the BCKW calculus (the comments are just that, comments; they are not actually typed):
`````S`KSKxx'x'' # B = S(KS)K `x`x'x'' # Bxyz = x(yz) `````S``S`K``S`KSKS`KKxx'x'' # C = S(S(K(S(KS)K))S)(KK) = S(S(KB)S)(KK +) ``xx''x' # Cxyz = xzy ````SS`K``SKKxx' # W = SS(K(SKK)) ``xx'x' # Wxy = xyy
This handles only textual, not functional, equivalence; for example, it correctly reduces ```SKKx to x, but it does not recognise that ``SK``SKK can be replaced by ``SKK.

Note, incidentally, that, if we had genuine variable-width look-behind (instead of simulating it with \K), then we could get rid of Perl control structures and just hoist this into a single substitution, at the expense of considerably more wasted space; 1 while s/$orig/$repl/ would become s/(?<=0$orig)/0$repl/g, where 0 could be any symbol that does not occur in the input or any of our productions.

Some projects:

  • Formalise the idea that, even with recursion, regexes aren't Turing complete.
  • Give the cellular automaton a way to halt, probably by allowing the user to enter a “stop pattern” and halting when that stop pattern is an initial segment of the tape.
  • I'm actually using very little of the power of Perl regexes in the recursive substitutions. Do we need anything more than honest-to-goodness regular expressions? We can get rid of \K as described above. I think that we can get rid of capturing groups by ‘sliding’ each proposed match, together with its proposed replacement, along the text until we find an actual match (kind of like the Boyer–Moore algorithm?).
  • A related one: As things stand, it is essential to use non-greedy quantifiers and to rely on the semantics of Perl regex matching in some places, to ensure that the right rule is chosen, and applied in the proper place. (That is, the ordering of replacement rules, and the particular spot at which the replacement is applied, are both significant.) Choose a better set of replacement rules, so that we do not have this dependence. (This is probably mostly a matter of using more symbols, so that we can tell, for example, the first . in the template for ```Sttt apart from the second.)
  • Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify ``SKK and ``SKS as functionally equivalent).
  • Actually implement a universal Turing machine this way!

* What does ‘sort of’ mean? An obvious loophole is that, any time you can embed arbitrary Perl code, you've got as much computational power as Perl—that is to say, Turing completeness. Thus, the (?{ }) and (??{ }) escapes are right out—which is just as well, since they're considered harmful experimental anyway. Similarly, there'll be no naughty /e switch when we talk about substitutions. To be perfectly precise, I view regexes as formed from atoms by successive iterations of concatenation, alternation, Kleene star (greedy, non-greedy, or possessive), and positive and negative look-around assertions, where an atom is a single character (i.e., EXACT<c> for some c) or a back-reference ***. See “What do you mean, done?” in MJD's TPJ regex article for an explanation of how most Perl regular expression notations are just syntactic sugar around these primitives.
** This argument would go out the window if Perl capturing groups could capture more than once—that is, if 'ab' =~ /(.)*/ could somehow record that it had matched both a and b—which might just explain why this feature for which I've often wished isn't present. :-) Anyway, even in that case, an argument by structural recursion on the regex would show that it must always halt (given the Perl optimisation whereby a 0-width match is not tried more than once), so that such ‘enhanced-capture’ regexes could still recognise only recursive languages.
*** This definition is actually a little worrying, because the meaning of the atom \1 (which, by itself, never matches anything, per a discussion in the CB yesterday with ambrus, bart, ikegami, moritz, and ysth) changes when it enters into a concatenation; but I leave the details to someone else to work out. :-)
**** You might worry that I've left out I, but it's not necessary, since it's functionally equivalent to SKK.

UPDATES: • Sorry, forgot the <readmore>. • Added another project. • Oops, I'd over-escaped the perlre links. I think that they should work now (except for the “sub-Perl-5.10” one anchored at (?(condition)yes-pattern|no-pattern), which I can't figure out how to handle, because of the |), but please leave a comment or PM me if any of the links are broken.
UPDATE: I realised that it might be a bit unclear how actually to use the provided code, so I added some sample runs.
UPDATE: I didn't know about Are Perl patterns universal? when I posted, but it looks like I'm at least 5 years late to the party—even the apparently very specialised Rule-110-for-Turing-completeness party. (Combinators and other esoterica made an appearance, too.) In fact, blokhead already observed what is in some sense the central point, that a regex can be simulated by an LBA.

Replies are listed 'Best First'.
Re: Turing completeness and regular expressions
by blokhead (Monsignor) on Nov 28, 2009 at 17:31 UTC
    Formalise the idea that, even with recursion, regexes aren't Turing complete.
    I think your approach of looking at the space required to simulate recursive regexes is the most straight-forward approach.

    A back-of-the-envelope calculation is as follows: the "state" that describes the regex evaluation consists of a stack. Each level in the stack corresponds to a nonterminal (i.e., regex to match at this recursive level), a position in the original string, and any captures and alternations already made by the regex. You can stop simulating this branch of the regex match every time the stack contains a repeated configuration, since if the regex will eventually match, then it must also match using a stack with no repeats. So you are only bound by the number of distinct stack configurations. Say, (# of nonterminals * length of string * 2^(# alternations in the regex) * (length of string)^(2 * # of captures)). The last term is to store a "start" and "end" pointer for each capture.

    Whatever this comes out to be, it is some fixed function of the input size (I guess in the EXPSPACE family), so there is some fixed space bound. By the Space hierarchy theorem, there are always languages that require more space.

    Do we need anything more than honest-to-goodness regular expressions?
    Probably not. You should be able to evaluate an SK-expression from the bottom up (instead of top-down) if it is fully parenthesized. Something like this:
    s/\(S([^()])([^()])([^()])\)/($1($3($2$3)))/; s/\(K([^()])([^()])\)/$1/; s/\(I([^()])\)/$1/;
    i.e., always evaluate the "innermost" expression, which will be the one without parentheses. I think something like this may work though I've not tested it.
    Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify ``SKK and ``SKS as functionally equivalent).
    Equivalent to the halting problem in the general case. Let an SK-expression simulate a Turing machine M on input x. Is it equivalent to the SK-expression that always returns true?

    blokhead

      Probably not. You should be able to evaluate an SK-expression from the bottom up (instead of top-down) if it is fully parenthesized. Something like this:
      s/\(S([^()])([^()])([^()])\)/($1($3($2$3)))/; s/\(K([^()])([^()])\)/$1/; s/\(I([^()])\)/$1/;
      i.e., always evaluate the "innermost" expression, which will be the one without parentheses. I think something like this may work though I've not tested it.
      I think that this does not work. For example, it does not reduce ((K((SK)K))K) to (SK)K (UPDATE: or, better, (((K((SK)K))K)x) to x).
      Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify ``SKK and ``SKS as functionally equivalent).
      Equivalent to the halting problem in the general case. Let an SK-expression simulate a Turing machine M on input x. Is it equivalent to the SK-expression that always returns true?
      Yes, that's what I thought; but Stenlund provides a finite system of equations for deciding—well, something. I thought that it was functional equivalence, but I must have misunderstood. I'll look it up when I've got my copy. (Incidentally, the combinator is KK.)
        You may be right about bottom-up evaluation of SK expressions. Larger terms must be treated as first-class objects to be passed around, etc.

        But we both missed the obvious way to just use "plain" regex substitutions to simulate Turing-completeness: unrestricted (type-0) grammars. They are basically defined as the repeated evaluation of plain regex substitutions, and are Turing-complete. A universal TM converted to a type-0 grammar will only have a finite number of substitution rules, and you can simulate it with a substitution + finite lookup table (or a finite # of separate s/// statements).

        blokhead

      Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify ``SKK and ``SKS as functionally equivalent).
      Equivalent to the halting problem in the general case. Let an SK-expression simulate a Turing machine M on input x. Is it equivalent to the SK-expression that always returns true?

      Ah, now I see what I had wrong.

      Stenlund mentions in “Combinators, λ-terms, and proof theory” (and I get the impression from the historical section of Introduction to higher-order categorical logic that this dates back to Curry) a finite equational axiomatisation of what I'm calling functional equivalence (and he calls extensional, or η-, equality), namely:

      S
      ```Sxyz = ``xz`yz
      K
      ``Kxy = x
      I
      `Ix = x
      c1
      `S`KI = I
      c2
      ``BS`S`KK = K
      c3
      ```B`B``B`BSSBS = ``P```B`BSBSS
      c4
      ``B``S``BBS`KKK = `BK
      c4
      `S`KI = ``SB`KI
      where he takes I as a primitive combinator, B = ``S`KSK is as above, and ````Pfgxy = ``f`gx`gy. (One usually writes Ψ instead of P, but I don't know how to get entities inside code tags. :-) )

      However: Having a finite axiomatisation of a theory is not the same as being able to decide the truth of an arbitrary proposition in that theory—witness Peano arithmetic—and that, of course, is where my ambition smacks against your refusal to solve the halting problem. :-)