comment on

[ UPDATE 27 Aug 2009: Readers interested in the topic of this node will want to look first (or instead) at the series of three articles I wrote for The Perl Review, now available online. They lay this proof out more carefully and with thorough explanations, in three different versions. ]

[ At this point this post should be considered mainly of historical interest. One especial defect is that it frames the issue in terms of "static parsing", implying that there are no similar issues with "dynamic" parsing. ]

In the man page for PPI, Adam Kennedy conjectures that perl is unparseable, and suggests how to prove it. Below I carry out a rigorous version of the proof, which should put the matter beyond doubt.

I've become interested in the question because I've just released an alpha version of a general parser (Parse::Marpa) on CPAN, which I think will allow static parsing of large portions of Perl 5, and I wanted to know what is achievable. Parse::Marpa accepts any BNF that's free of infinite loops. The BNF can be recursive, have empty productions or even be ambiguous. If Marpa works for parsing Perl 5, it will do it with a lot less cruft than ad hoc solutions. Parse::Marpa is based on new research into combining LR(0) precomputation with Earley's algorithm and so far speed seems good -- quite acceptable for utility purposes.

For those not familiar with the history of this discussion, the term "parse" here is being used in its strict sense to mean static parsing -- taking a piece of code and determining its structure without executing it. In that strict sense the Perl program does not parse Perl. The Perl program executes Perl code, but does not determine its structure. Adam Kennedy gives a good account of the discussion. Randal Schwartz played a key role in it, and one of his perlmonks nodes is pivotal.

Static parsing of Perl 5 is of a lot more than academic interest, as Adam Kennedy shows. It is needed for automated documentation tools, analyzers like Perl::Critic, presentation tools, automatic transformation of Perl code, etc.

The proof which follows meets the current level of rigor in Theory of Computation, but is written using Perl and Perl notation. That would make the following unacceptable to a math journal, but they wouldn't take it anyway, because the theorem is a very straightforward consequence of Rice's Theorem.

Theorem: Parsing Perl 5 is Undecidable

We first establish Adam Kennedy's conjecture as a lemma. The proof will follow immediately from that and the Halting Theorem.

Kennedy's Lemma: If you can parse Perl, you can solve the Halting Problem.

To prove Kennedy's Lemma, we assume that we can parse Perl. In particular this means we can take the following devilish snippet of code, concocted by Randal Schwartz, and determine the correct parse for it:

whatever / 25 ; # / ; die "this dies!";

Schwartz's Snippet can parse two different ways: if whatever is nullary (that is, takes no arguments), the first statement is a division in void context, and the rest of the line is a comment. If whatever takes an argument, Schwartz's Snippet parses as a call to the whatever function with the result of a match operator, then a call to the die() function.

This means that, in order to statically parse Perl, it must be possible to determine from a string of Perl 5 code whether it establishes a nullary prototype for the whatever subroutine. Since we've assumed we can parse Perl, we can assume that a subroutine to do this exists. Call the subroutine which takes as its only argument a Perl 5 code string, and returns true if and only if that code string establishes a nullary prototype for the whatever subroutine, is_whatever_nullary().

To drag the Halting Theorem into this, we'll need to simulate a Turing machine or its equivalent. It's very evident that Perl 5 is Turing-complete. No referee at a math journal would require something that obvious and that tedious to be proved. The term used in these cases is "left as an exercise to the reader". But in this case, there is an Acme::Turing, so the exercise apparently has already been done.

We wrap the Turing machine simulator of our choice in a routine that takes two strings as its arguments, and treats the first string as the representation of a Turing machine, and the second as its input. Call this run_turing_machine.

Now we write a routine, call it halts(), which takes the description of a Turing machine and its input. We have it create (but not run) a Perl 5 code string to run the Turing machine simulator on the machine description and input from our two arguments, and then establish a nullary prototype for whatever. We next ask is_whatever_nullary() whether the nullary prototype for whatever was established. Our halts() routine might look like this:

sub halts {
    my $machine = shift;
    my $input = shift;
    my $code_string_to_analyze = qq{
          BEGIN {
               run_turing_machine("\Q$machine\E", "\Q$input\E");
               sub whatever() {};
          }
     };
    is_whatever_nullary($code_string_to_analyze);
}
[download]

$code_string_to_analyze is passed as an argument to is_whatever_nullary(), which claims to be able to figure out, somehow, if the nullary whatever prototype is established. is_whatever_nullary() does not necessarily run $code_string_to_analyze. In fact if the Turing machine simulation does not halt, is_whatever_nullary() can't run $code_string_to_analyze, not and live up to the assumption that it will tell us whether the prototype is established or not. To do this, is_whatever_nullary() must somehow figure out when $machine does not halt with $input. Since the next thing in $code_string_to_analyze is the nullary prototype, if $machine halts with $input, is_whatever_nullary() will return true. If $machine does not halt with $input, the statement establishing the nullary whatever prototype will never be reached, and is_whatever_nullary() must return false.

So, given the assumption that we can parse Perl, halts() returns true if and only if the Turing machine $machine halts with input $input. In other words, halts() solves the Halting Problem. Kennedy's Lemma was that, if you can parse Perl, you can solve the Halting Problem. So this proves Kennedy's Lemma.

It's well known that the Halting Problem cannot be solved. Kennedy's Lemma establishes that if we can parse Perl 5, we can solve the Halting Problem. Therefore we cannot parse Perl 5.

QED

[ UPDATE: Presentation improved based on feedback from tye. ]

In reply to Perl Cannot Be Parsed: A Formal Proof by Jeffrey Kegler

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.