perlmeditation
JadeNB
<p><small>This was originally intended to be a [Cool Uses For Perl|CUFP], but I thought that the story behind the problem was at least as interesting as the solution; and, the more I wrote of the story, the more [Meditations|meditative] it seemed. (I think that, even if the specific code doesn't catch your attention much, at least one of the linked topics might!) I hope that this was the right place for it to end up.</small></p>
<p>The question has (implicitly) rung out, on [id://802914|Perlmonks], [http://www.reddit.com/r/programming/comments/a4kze/im_not_sure_if_this_guys_thinks_parsing_html_with/c0ftjif|Proggit], and, one must presume, the haunted nightmares of many: Are Perl regexes Turing complete?</p>
<p>This question caught the eye of yours truly, except that I confused the Turing completeness of regexes with the Turing completeness of substitutions, which led to [id://806115|an apparently unmotivated question] and which required [id://806263|ikegami] to set me straight.</p>
<p>Now I think that I've got the question: Can Perl regexes be used to recognise an arbitrary [http://qwiki.stanford.edu/wiki/Complexity_Zoo:R#re|recursively enumerable language]? The stirring answer: <b>No</b>, <small>sort of</small> <sup>*</sup>.</p>
<readmore><p>Why not? Well, a regex can be simulated by a [wp://linear bounded automaton|linear-bounded automaton] (that's the hyphenation, and I'll stick by it), and these can recognise only the [http://qwiki.stanford.edu/wiki/Complexity_Zoo:C#csl|context-sensitive languages] (not all the [http://qwiki.stanford.edu/wiki/Complexity_Zoo:R#re|recursively-enumerable] ones, as a Turing machine can). Another way to say this is that regexes can only recognise languages in [http://qwiki.stanford.edu/wiki/Complexity_Zoo:N#nspace|NSPACE(n)]. Thought of in this way, it's not so surprising: A regex has finitely many capturing groups, say <c>k</c> of them, each of which can be stored in space equal to the length, <c>n</c>, of the input (since they will, after all, only contain substrings of the input); so we only need space <c>k*n</c> to perform the ‘super-[wp://DFA]’ part of Perl regex matching <sup>**</sup>. I suspect this upper bound is far <strike>too</strike> from sharp, and that Perl-5.8 regexes can't recognise balanced parentheses, which are only [http://qwiki.stanford.edu/wiki/Complexity_Zoo:C#cfl|context free]. What if we fix this by upgrading to Perl-5.10 regexes, which allow [doc://perlre#(?PARNO)-(?-PARNO)-(?+PARNO)-(?R)-(?0)
|recursion]? Well, then we can obviously at least recognise balanced parentheses, but I suspect that we're still not Turing-complete—indeed, still only context sensitive. Maybe someone more versed in CS than am I can confirm this.</p>
<p>Well, so how about the question which I understood in the first place: Can Perl substitutions be used to perform any computation that a Turing machine can, viewing the string on which they act as the input tape? Again: <b>No</b>, <small>sort of</small> <sup>*</sup>. After all, a substitution consists just of a match—which can be completed by an LBA, as above—followed by the assembly of the substitution string—which, again, involves only constant overhead (the literal, or constant, part of the match) over a constant multiple of the input (the part coming from interpolation of capture variables—any other variables are not affected by the match, and so may as well be regarded as constants). Thus, substitutions can still handle only CSLs.</p>
<p>Well, that's a little anti-climactic. So, what <em>is</em> true? Well, <em>recursive</em> Perl substitutions <em>are</em> Turing complete. As I pointed out [id://802614|elsewhere in the linked thread], that's known; but, hey, if I can figure out in a month's work what's obvious to [Dominus|MJD], then I figure that I'm doing all right. :-)</p>
<p>OK, so, that was a post full of negatives followed by a bold assertion; let's substantiate it. The obvious way to show Turing completeness is to simulate a Turing machine, but, man, that's been done to death. I heard the plea of [http://lambda-the-ultimate.org/node/1864|cdiggins] on LtU, who wondered why everyone prefers the [wp://Lambda calculus|λ calculus] over the [wp://SKI combinator calculus|SKI combinator-calculus], and determined that I would not be among those hip and trendy kids who think only of the former. I quickly came up with:
<code>
perl -pe '
BEGIN { my $term = qr/(S|K|\((?-1)(?-1)\))/ }
1 while
s/\(\(\(S($term)\)($term)\)($term)\)/(($1$3)($2$3))/
or s/\(\(K($term)\)($term)\)/$2/;
'
</code>
which does a great job of reducing (fully parenthesised) terms from the calculus <sup>****</sup>, but is a little unsatisfying because it requires recursive regexes, which I've just said that I don't know aren't already Turing complete. All right, so, moving on.</p>
<p>Well, we know that [wp://Conway's game of life] is Turing complete, so maybe that's a good place to start; but it's played out on a 2-dimensional grid, and translating that 2-dimensional grid into the 1-dimensional ‘tape’ for a regex seemed to be a little hairy. So, a 1-dimensional version—well, that sounds like [wp://cellular automata]. Everybody's favourite [wp://A New Kind of Science|new kind of scientist], who studied these extensively, has a [wp://Rule 110|favourite]—which we know, thanks to Matthew Cook, is [http://en.wikipedia.org/wiki/Rule_110#The_proof_of_universality|Turing complete]. So, let's simulate that! The problem is that there are 8 different rules for cell propagation, and running 8 different substitutions seems sort of profligate; can we get it down to 2, as we did for the SKI combinator-calculus above? If (as I did) we get stuck on that, can we at least show that, if we could do it, we'd have got an optimal result—that is, that a 1-regex substitution can never be Turing complete? I spent a few sleepless nights on this; I was sure that it wasn't possible—after all, one of Turing's criteria for a universal computer is that it should be able to make choices; but our [doc://perlre#(?(condition)yes-pattern%7cno-pattern)
|sub-Perl-5.10] regexes can't, so we should have to build in at least one choice “manually”, on the level of multiple regexes. How to prove it, though? I thought that I could come up with some argument from [wp://Kolmogorov complexity], but no luck—because, it finally occurred to me, I was guilty of the same blinkered thinking as the designers of the [http://en.wikipedia.org/wiki/Von_Neumann_architecture#Description|early, “fixed-program” computers], and was trying to shove the entire program into the architecture. Why not instead let the program be recorded in the data?</p>
<p>Out of that idea came the following code for implementing rule 110 (or, rather, rule 124, which propagates to the right instead of the left):
<code>
perl -Mre=eval -nE <<CODE
chomp;
$_ = <<RULE . "---\n$_";
^111 0^11
^110 1^10
^11 1^1
^01 1^1
^10 1^0
^00 0^0
^1 1
^0
1 1^1
0 0^0
RULE
BEGIN {
my $nc_rule = qr/^ *\S+ *\S* *$/m;
my $c_rule = qr/^ *(\S+) *(\S*) *$/m;
}
/\^.*$/ or say( ( split /\n/ )[-1] ) while
s/$c_rule(?:\n$nc_rule)*\n---\n.*?\K\1/$2/;
CODE
</code>
(The here-doc syntax is fake—at least for me, under <c>bash</c>, <c>perl</c> complains <c>No code specified for -e</c>—but I assume that someone out there knows how to fix it.) This accepts input a string of <c>0</c>s and <c>1</c>s, which is regarded as infinitely padded with <c>0</c>s on both sides, and prints out its successive evolutes under rule 124. For example, here's a sample run, showing the evolution of the input <c>1</c> (which stands for <c>...0001000...</c>):
<code>
1
11
111
1011
11111
100011
1100111
11101011
101111111
1110000011
10110000111
111110001011
1000110011111
11001110100011
111010111100111
...
</code>
(It goes on forever, so don't wait for the surprise finish. :-) See the list of projects below!)
</p>
<spoiler><p>The basic idea is to grab a pair <c>orig repl</c> from the list at the beginning, search for <c>orig</c> in the input, and replace it by <c>repl</c>. It's quite important for this particular set-up that the semantics of Perl regex matching are such that we grab the <em>first</em> pair possible.</p>
<p>The effect in our case is to move a cursor, <c>^</c>, rightwards along the string of <c>0</c>s and <c>1</c>s. (The last 2 rules tell us to create a cursor at the far left if there isn't one already.) At every point, the character immediately to the right of the cursor tells us what <em>used</em> to be to the left of the cursor. (The input is changing as the cursor moves along it, so we need a way to record it), and the next character down tells us which character we are currently considering. The replacement rule takes care of replacing the character under consideration, moving the cursor, and updating its history. Thus, each substitution moves the cursor by one character. When the cursor reaches the right edge of the string, we extend it (if it ended with a <c>1</c>) or truncate it (if it ended with a <c>0</c>, which we are now regarding as part of the ‘padding’), then remove the cursor. The pre-<c>while</c> statement prints the mutated form of the input only when there is no cursor, i.e., when we have completed a pass over the input (which happens after many substitutions).</p>
<p>We require Perl 5.10 (via the <c>-E</c> switch) only for [doc://say|<c>say</c>], which is obviously inessential, and [doc://perlre#(?<=pattern)-\K|<c>\K</c>]; and <c>\K</c>, in turn, is just for efficiency, to avoid wrapping a big capturing group around everything preceding and just copying it over <i>literatim</i>.</spoiler>
<p>Emboldened by this success, I thought that perhaps I could tackle the original problem of reduction for the SKI combinator-calculus. To keep things simple (well, for me), I decided to use the [http://www.madore.org/~david/programs/unlambda|Unlambda] syntax, whereby <c>`</c> is used as the notation for application, so that <c>(SK)</c> becomes <c>`SK</c>. I also wanted to include support for variables; again to keep life simple, I stuck with the austere CSer's set <c>x</c>, <c>x'</c>, …. The code is as follows:
<code>
perl -Mre=eval -nE <<CODE
chomp;
$_ = <<RULE . "---\n$_";
+ ^'>'+ ^
+ ^> ^
. ^`>`.. ^
. ^S>S ^
. ^K>K ^
. ^x>x+ ^
- ^'>- ^
- ^> ^
! ^`>!! ^
! ^S> ^
! ^K> ^
! ^x>- ^
, ; ^'>', '; ^
, ; ^> ^
@ * ^`>`@@ `** ^
@ * ^S>S S ^
@ * ^K>K K ^
@ * ^x>x, x; ^
^>
```S>``.@`.*^
``K>.!^
RULE
BEGIN {
my $vs = qr/[^ >\n]*/;
my $ss = qr/ *?/;
my $c_rule = qr/(?>($vs)$ss($vs)$ss($vs)$ss>$ss($vs)$ss($vs)$ss($vs)$)/m;
my $nc_rule = qr/(?>(?:$vs$ss){3}>(?:$ss$vs){3}$)/m;
}
1 while
s/$c_rule(?:\n+$nc_rule)*\n---\n.*?\K\1(.*?)\2(.*?)\3/$4$7$5$8$6/;
say ( ( split /\n/ )[-1] );
CODE
</code>
This accepts strings according to the grammar
<code>
t ::= S
t ::= K
t ::= xp
p ::= 'p
p ::=
t ::= `tt
</code>
where <c>t</c> is the start symbol; <c>p</c> is another non-terminal; and <c>S</c>, <c>K</c>, <c>x</c>, <c>'</c>, and <c>`</c> are terminals. It then reduces them according to the rules of the SKI calculus. Here is a sample run, showing that it correctly reduces the basic combinators in the [wp://B,C,K,W system|<c>BCKW</c> calculus] (the comments are just that, comments; they are not actually typed):
<code>
`````S`KSKxx'x'' # B = S(KS)K
`x`x'x'' # Bxyz = x(yz)
`````S``S`K``S`KSKS`KKxx'x'' # C = S(S(K(S(KS)K))S)(KK) = S(S(KB)S)(KK)
``xx''x' # Cxyz = xzy
````SS`K``SKKxx' # W = SS(K(SKK))
``xx'x' # Wxy = xyy
</code>
This handles only textual, not functional, equivalence; for example, it correctly reduces <c>```SKKx</c> to <c>x</c>, but it does not recognise that <c>``SK``SKK</c> can be replaced by <c>``SKK</c>.</p>
<spoiler><p>The idea here is the same as before, except that we are now looking for rules of the form <c>orig1 orig2 orig3 > repl1 repl2 repl3</c>, where some of the <c>orig</c>s and some of the <c>repl</c>s may be empty. We look for [wp://Reduction strategy|redexes] (<em>not</em> regexes!) <c>```Sttt</c> or <c>``Ktt</c>, where <c>t</c> stands for arbitrary terms (that's what the last 2 rules are doing); when we find one, we mark it in the text with a <c>^</c>, and precede it with a template for its reduction. Namely,
<ul>
<li><c>```Sxyz</c> reduces to <c>``xz`yz</c>; we indicate this with the patern <c>``.@`.*</c>, with the understanding that we are to fill in the leftmost <c>.</c> with the first term that we encounter. When we fill in both <c>.</c>s this way, we look for another term, and copy it into the <c>@</c> and <c>*</c> slots simultaneously.</li>
<li><c>``Kxy</c> reduces to <strike><c>y</c></strike> <c>x</c>; we indicate this with the pattern <strike><c>!.</c></strike> <c>.!</c>, with the understanding that we are to <strike>delete the next term we read, then fill in the <c>.</c> with the term after that</strike> fill in the <c>.</c> with the next term we read, then delete the term after that.</li>
</ul>
Once we have found and marked a redex, we need to read in terms. Notice that we perform no error checking; we assume that what we've been given is syntactically legal. Based on this assumption, we read off terms recursively by saying that <c>S</c> and <c>K</c> are terms, <c>x</c> followed by any string of <c>'</c>s is a term, and, if we read <c>`</c>, then we need to look for 2 more terms. As we read the terms, we copy them into the template for the redex, as indicated above. Finally, once the template is filled (as indicated by their being no more rules to apply), we apply the 3rd-to-last rule, which just deletes the redex marker <c>^</c>. We then loop until we're done.</p>
<p>I find it fun to watch this in action, rather than just seeing the end result. If you like that, too, then you can replace <c>1 while ...</c> by <c>say ( ( split /\n/ )[-1} ) while ...</c>.</p></spoiler>
<p>Note, incidentally, that, if we had genuine variable-width look-behind (instead of simulating it with <c>\K</c>), then we could get rid of Perl control structures and just hoist this into a single substitution, at the expense of considerably more wasted space; <c>1 while s/$orig/$repl/</c> would become <c>s/(?<=0$orig)/0$repl/g</c>, where <c>0</c> could be any symbol that does not occur in the input or any of our productions.</p>
<p>Some projects:
<ul>
<li>Formalise the idea that, even with recursion, regexes aren't Turing complete.</li>
<li>Give the cellular automaton a way to halt, probably by allowing the user to enter a “stop pattern” and halting when that stop pattern is an initial segment of the tape.</li>
<li>I'm actually using very little of the power of Perl regexes in the recursive substitutions. Do we need anything more than honest-to-goodness [wp://regular expressions]? We can get rid of <c>\K</c> as described above. I think that we can get rid of capturing groups by ‘sliding’ each proposed match, together with its proposed replacement, along the text until we find an actual match (kind of like the [wp://Boyer-Moore algorithm|Boyer–Moore algorithm]?).</li>
<li>A related one: As things stand, it is essential to use non-greedy quantifiers and to rely on the semantics of Perl regex matching in some places, to ensure that the right rule is chosen, and applied in the proper place. (That is, the ordering of replacement rules, and the particular spot at which the replacement is applied, are both significant.) Choose a better set of replacement rules, so that we do not have this dependence. (This is probably mostly a matter of using more symbols, so that we can tell, for example, the first <c>.</c> in the template for <c>```Sttt</c> apart from the second.)</li>
<li>Set up a recursive substitution to recognise functional equivalence of combinators (so that it could identify <c>``SKK</c> and <c>``SKS</c> as functionally equivalent).</li>
<li>Actually implement a universal Turing machine this way!</li>
</ul></p>
<p><sup>*</sup> What does ‘sort of’ mean? An obvious loophole is that, any time you can embed arbitrary Perl code, you've got as much computational power as Perl—that is to say, Turing completeness. Thus, the <c>(?{ })</c> and <c>(??{ })</c> escapes are right out—which is just as well, since they're [doc://perlre#(?{-code-})|considered <strike>harmful</strike> experimental] anyway. Similarly, there'll be no naughty <c>/e</c> switch when we talk about substitutions. To be perfectly precise, I view regexes as formed from atoms by successive iterations of concatenation, alternation, Kleene star ([doc://perlre#Quantifiers|greedy, non-greedy], or [doc://perlre#(?>pattern)|possessive]), and [doc://perlre#Look-Around-Assertions|positive and negative look-around assertions], where an atom is a single character (i.e., <code>EXACT<c></code> for some <c>c</c>) or a back-reference <sup>***</sup>. See “What do you mean, done?” in MJD's [http://perl.plover.com/Regex/article.html|TPJ regex article] for an explanation of how most Perl regular expression notations are just syntactic sugar around these primitives.
<br /><sup>**</sup> This argument would go out the window if Perl capturing groups could capture more than once—that is, if <c>'ab' =~ /(.)*/</c> could somehow record that it had matched both <c>a</c> and <c>b</c>—which might just explain why this feature for which I've often wished isn't present. :-) Anyway, even in that case, an argument by [wp://structural recursion] on the regex would show that it must always halt (given the Perl [doc://perlre#Repeated-Patterns-Matching-a-Zero-length-Substring|optimisation] whereby a <c>0</c>-width match is not tried more than once), so that such ‘enhanced-capture’ regexes could still recognise only [http://qwiki.stanford.edu/wiki/Complexity_Zoo:R#r|recursive] languages.
<br /><sup>***</sup> This definition is actually a little worrying, because the meaning of the atom <c>\1</c> (which, by itself, never matches anything, per a discussion in the CB yesterday with [ambrus], [bart], [ikegami], [moritz], and [ysth]) changes when it enters into a concatenation; but I leave the details to someone else to work out. :-)
<br /><sup>****</sup> You might worry that I've left out <c>I</c>, but it's not necessary, since it's functionally equivalent to <c>SKK</c>.</p></readmore>
<p>UPDATES: • Sorry, forgot the <c><readmore></c>. • Added another project. • Oops, I'd over-escaped the <c>perlre</c> links. I think that they should work now (except for the “sub-Perl-5.10” one anchored at <c>(?(condition)yes-pattern|no-pattern)</c>, which I can't figure out how to handle, because of the <c>|</c>), but please leave a comment or PM me if any of the links are broken.
<br />
UPDATE: I realised that it might be a bit unclear how actually to <em>use</em> the provided code, so I added some sample runs.
<br />
UPDATE: I didn't know about [id://406253] when I posted, but it looks like I'm at least 5 years late to the party—even the apparently very specialised Rule-110-for-Turing-completeness party. ([id://406400|Combinators and other esoterica] made an appearance, too.) In fact, [id://406290|blokhead] already observed what is in some sense the central point, that a regex can be simulated by an LBA.</p>