This was originally intended to be a CUFP, but I thought that the story behind the problem was at least as interesting as the solution; and, the more I wrote of the story, the more meditative it seemed. (I think that, even if the specific code doesn't catch your attention much, at least one of the linked topics might!) I hope that this was the right place for it to end up.
The question has (implicitly) rung out, on Perlmonks, Proggit, and, one must presume, the haunted nightmares of many: Are Perl regexes Turing complete?
This question caught the eye of yours truly, except that I confused the Turing completeness of regexes with the Turing completeness of substitutions, which led to an apparently unmotivated question and which required ikegami to set me straight.
Now I think that I've got the question: Can Perl regexes be used to recognise an arbitrary recursively enumerable language? The stirring answer: No, sort of ^{*}.
Why not? Well, a regex can be simulated by a linearbounded automaton (that's the hyphenation, and I'll stick by it), and these can recognise only the contextsensitive languages (not all the recursivelyenumerable ones, as a Turing machine can). Another way to say this is that regexes can only recognise languages in NSPACE(n). Thought of in this way, it's not so surprising: A regex has finitely many capturing groups, say k of them, each of which can be stored in space equal to the length, n, of the input (since they will, after all, only contain substrings of the input); so we only need space k*n to perform the ‘superDFA’ part of Perl regex matching ^{**}. I suspect this upper bound is far too from sharp, and that Perl5.8 regexes can't recognise balanced parentheses, which are only context free. What if we fix this by upgrading to Perl5.10 regexes, which allow recursion? Well, then we can obviously at least recognise balanced parentheses, but I suspect that we're still not Turingcomplete—indeed, still only context sensitive. Maybe someone more versed in CS than am I can confirm this.
Well, so how about the question which I understood in the first place: Can Perl substitutions be used to perform any computation that a Turing machine can, viewing the string on which they act as the input tape? Again: No, sort of ^{*}. After all, a substitution consists just of a match—which can be completed by an LBA, as above—followed by the assembly of the substitution string—which, again, involves only constant overhead (the literal, or constant, part of the match) over a constant multiple of the input (the part coming from interpolation of capture variables—any other variables are not affected by the match, and so may as well be regarded as constants). Thus, substitutions can still handle only CSLs.
Well, that's a little anticlimactic. So, what is true? Well, recursive Perl substitutions are Turing complete. As I pointed out elsewhere in the linked thread, that's known; but, hey, if I can figure out in a month's work what's obvious to MJD, then I figure that I'm doing all right. :)
OK, so, that was a post full of negatives followed by a bold assertion; let's substantiate it. The obvious way to show Turing completeness is to simulate a Turing machine, but, man, that's been done to death. I heard the plea of cdiggins on LtU, who wondered why everyone prefers the λ calculus over the SKI combinatorcalculus, and determined that I would not be among those hip and trendy kids who think only of the former. I quickly came up with:
which does a great job of reducing (fully parenthesised) terms from the calculus ^{****}, but is a little unsatisfying because it requires recursive regexes, which I've just said that I don't know aren't already Turing complete. All right, so, moving on.perl pe ' BEGIN { my $term = qr/(SK\((?1)(?1)\))/ } 1 while s/\(\(\(S($term)\)($term)\)($term)\)/(($1$3)($2$3))/ or s/\(\(K($term)\)($term)\)/$2/; '
Well, we know that Conway's game of life is Turing complete, so maybe that's a good place to start; but it's played out on a 2dimensional grid, and translating that 2dimensional grid into the 1dimensional ‘tape’ for a regex seemed to be a little hairy. So, a 1dimensional version—well, that sounds like cellular automata. Everybody's favourite new kind of scientist, who studied these extensively, has a favourite—which we know, thanks to Matthew Cook, is Turing complete. So, let's simulate that! The problem is that there are 8 different rules for cell propagation, and running 8 different substitutions seems sort of profligate; can we get it down to 2, as we did for the SKI combinatorcalculus above? If (as I did) we get stuck on that, can we at least show that, if we could do it, we'd have got an optimal result—that is, that a 1regex substitution can never be Turing complete? I spent a few sleepless nights on this; I was sure that it wasn't possible—after all, one of Turing's criteria for a universal computer is that it should be able to make choices; but our subPerl5.10 regexes can't, so we should have to build in at least one choice “manually”, on the level of multiple regexes. How to prove it, though? I thought that I could come up with some argument from Kolmogorov complexity, but no luck—because, it finally occurred to me, I was guilty of the same blinkered thinking as the designers of the early, “fixedprogram” computers, and was trying to shove the entire program into the architecture. Why not instead let the program be recorded in the data?
Out of that idea came the following code for implementing rule 110 (or, rather, rule 124, which propagates to the right instead of the left):
(The heredoc syntax is fake—at least for me, under bash, perl complains No code specified for e—but I assume that someone out there knows how to fix it.) This accepts input a string of 0s and 1s, which is regarded as infinitely padded with 0s on both sides, and prints out its successive evolutes under rule 124. For example, here's a sample run, showing the evolution of the input 1 (which stands for ...0001000...):perl Mre=eval nE <<CODE chomp; $_ = <<RULE . "\n$_"; ^111 0^11 ^110 1^10 ^11 1^1 ^01 1^1 ^10 1^0 ^00 0^0 ^1 1 ^0 1 1^1 0 0^0 RULE BEGIN { my $nc_rule = qr/^ *\S+ *\S* *$/m; my $c_rule = qr/^ *(\S+) *(\S*) *$/m; } /\^.*$/ or say( ( split /\n/ )[1] ) while s/$c_rule(?:\n$nc_rule)*\n\n.*?\K\1/$2/; CODE
(It goes on forever, so don't wait for the surprise finish. :) See the list of projects below!)1 11 111 1011 11111 100011 1100111 11101011 101111111 1110000011 10110000111 111110001011 1000110011111 11001110100011 111010111100111 ...
Emboldened by this success, I thought that perhaps I could tackle the original problem of reduction for the SKI combinatorcalculus. To keep things simple (well, for me), I decided to use the Unlambda syntax, whereby ` is used as the notation for application, so that (SK) becomes `SK. I also wanted to include support for variables; again to keep life simple, I stuck with the austere CSer's set x, x', …. The code is as follows:
This accepts strings according to the grammarperl Mre=eval nE <<CODE chomp; $_ = <<RULE . "\n$_"; + ^'>'+ ^ + ^> ^ . ^`>`.. ^ . ^S>S ^ . ^K>K ^ . ^x>x+ ^  ^'> ^  ^> ^ ! ^`>!! ^ ! ^S> ^ ! ^K> ^ ! ^x> ^ , ; ^'>', '; ^ , ; ^> ^ @ * ^`>`@@ `** ^ @ * ^S>S S ^ @ * ^K>K K ^ @ * ^x>x, x; ^ ^> ```S>``.@`.*^ ``K>.!^ RULE BEGIN { my $vs = qr/[^ >\n]*/; my $ss = qr/ *?/; my $c_rule = qr/(?>($vs)$ss($vs)$ss($vs)$ss>$ss($vs)$ss($vs)$ss($v +s)$)/m; my $nc_rule = qr/(?>(?:$vs$ss){3}>(?:$ss$vs){3}$)/m; } 1 while s/$c_rule(?:\n+$nc_rule)*\n\n.*?\K\1(.*?)\2(.*?)\3/$4$7$5$8$6/; say ( ( split /\n/ )[1] ); CODE
where t is the start symbol; p is another nonterminal; and S, K, x, ', and ` are terminals. It then reduces them according to the rules of the SKI calculus. Here is a sample run, showing that it correctly reduces the basic combinators in the BCKW calculus (the comments are just that, comments; they are not actually typed):t ::= S t ::= K t ::= xp p ::= 'p p ::= t ::= `tt
This handles only textual, not functional, equivalence; for example, it correctly reduces ```SKKx to x, but it does not recognise that ``SK``SKK can be replaced by ``SKK.`````S`KSKxx'x'' # B = S(KS)K `x`x'x'' # Bxyz = x(yz) `````S``S`K``S`KSKS`KKxx'x'' # C = S(S(K(S(KS)K))S)(KK) = S(S(KB)S)(KK +) ``xx''x' # Cxyz = xzy ````SS`K``SKKxx' # W = SS(K(SKK)) ``xx'x' # Wxy = xyy
Note, incidentally, that, if we had genuine variablewidth lookbehind (instead of simulating it with \K), then we could get rid of Perl control structures and just hoist this into a single substitution, at the expense of considerably more wasted space; 1 while s/$orig/$repl/ would become s/(?<=0$orig)/0$repl/g, where 0 could be any symbol that does not occur in the input or any of our productions.
Some projects:
 Formalise the idea that, even with recursion, regexes aren't Turing complete.
 Give the cellular automaton a way to halt, probably by allowing the user to enter a “stop pattern” and halting when that stop pattern is an initial segment of the tape.
 I'm actually using very little of the power of Perl regexes in the recursive substitutions. Do we need anything more than honesttogoodness regular expressions? We can get rid of \K as described above. I think that we can get rid of capturing groups by ‘sliding’ each proposed match, together with its proposed replacement, along the text until we find an actual match (kind of like the Boyer–Moore algorithm?).
 A related one: As things stand, it is essential to use nongreedy quantifiers and to rely on the semantics of Perl regex matching in some places, to ensure that the right rule is chosen, and applied in the proper place. (That is, the ordering of replacement rules, and the particular spot at which the replacement is applied, are both significant.) Choose a better set of replacement rules, so that we do not have this dependence. (This is probably mostly a matter of using more symbols, so that we can tell, for example, the first . in the template for ```Sttt apart from the second.)
 Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify ``SKK and ``SKS as functionally equivalent).
 Actually implement a universal Turing machine this way!
^{*} What does ‘sort of’ mean? An obvious loophole is that, any time you can embed arbitrary Perl code, you've got as much computational power as Perl—that is to say, Turing completeness. Thus, the (?{ }) and (??{ }) escapes are right out—which is just as well, since they're considered harmful experimental anyway. Similarly, there'll be no naughty /e switch when we talk about substitutions. To be perfectly precise, I view regexes as formed from atoms by successive iterations of concatenation, alternation, Kleene star (greedy, nongreedy, or possessive), and positive and negative lookaround assertions, where an atom is a single character (i.e., EXACT<c> for some c) or a backreference ^{***}. See “What do you mean, done?” in MJD's TPJ regex article for an explanation of how most Perl regular expression notations are just syntactic sugar around these primitives.
^{**} This argument would go out the window if Perl capturing groups could capture more than once—that is, if 'ab' =~ /(.)*/ could somehow record that it had matched both a and b—which might just explain why this feature for which I've often wished isn't present. :) Anyway, even in that case, an argument by structural recursion on the regex would show that it must always halt (given the Perl optimisation whereby a 0width match is not tried more than once), so that such ‘enhancedcapture’ regexes could still recognise only recursive languages.
^{***} This definition is actually a little worrying, because the meaning of the atom \1 (which, by itself, never matches anything, per a discussion in the CB yesterday with ambrus, bart, ikegami, moritz, and ysth) changes when it enters into a concatenation; but I leave the details to someone else to work out. :)
^{****} You might worry that I've left out I, but it's not necessary, since it's functionally equivalent to SKK.
UPDATES: • Sorry, forgot the <readmore>. • Added another project. • Oops, I'd overescaped the perlre links. I think that they should work now (except for the “subPerl5.10” one anchored at (?(condition)yespatternnopattern), which I can't figure out how to handle, because of the ), but please leave a comment or PM me if any of the links are broken.
UPDATE: I realised that it might be a bit unclear how actually to use the provided code, so I added some sample runs.
UPDATE: I didn't know about Are Perl patterns universal? when I posted, but it looks like I'm at least 5 years late to the party—even the apparently very specialised Rule110forTuringcompleteness party. (Combinators and other esoterica made an appearance, too.) In fact, blokhead already observed what is in some sense the central point, that a regex can be simulated by an LBA.


Replies are listed 'Best First'.  

Re: Turing completeness and regular expressions
by blokhead (Monsignor) on Nov 28, 2009 at 17:31 UTC  
by JadeNB (Chaplain) on Nov 28, 2009 at 17:41 UTC  
by blokhead (Monsignor) on Nov 28, 2009 at 17:52 UTC  
by JadeNB (Chaplain) on Nov 28, 2009 at 18:00 UTC  
by blokhead (Monsignor) on Nov 28, 2009 at 18:20 UTC  
 
by JadeNB (Chaplain) on Dec 03, 2009 at 16:13 UTC 