I have found documentation on eliminating left-recursion (such as Eliminating Left Recursion in Parse::RecDescent) to be unsatisfactory. Left recursion is usually eliminated at the expense of associativity. This tutorial seeks to address this issue.
The document provides two implementations for every topic covered. The first shows how the topic applies when evaluating the text at parse time. The second shows how the topic applies when building a parse tree. It is probably best to ignore the latter (parse tree creation) until the former (parse-time eval) is understood.
Feedback and criticisms are welcome.
The Perl binary operators + and - have the same precedence, but that doesn't mean they can be evaluated in any order. For example, consider 4 - 5 + 6.
If executed from left-to-right, 4 - 5 + 6 = (4 - 5) + 6 = 5 If executed from right-to-left, 4 - 5 + 6 = 4 - (5 + 6) = -7
Similarly,
If executed from left-to-right, 4 ** 3 ** 2 = (4 ** 3) ** 2 = 4096 If executed from right-to-left, 4 ** 3 ** 2 = 4 ** (3 ** 2) = 262144
Operators which are evaluated from left-to-right are left-associative.
Operators which are evaluated from right-to-left are right-associative.
In Perl, binary operators + and - are left-associative, and binary operator ** is right-associative. (Refer to Operator Precedence and Associativity in perlop for the associativity of other operators.)
Grammars do not specify associativity. A grammar simply defines whether a given string is valid in the language represented by the grammar, and associativity is not needed for that purpose.
However, we're rarely just interested in validity check. Parsers that return a parse tree representing the text being parsed and those that evaluate the text being parsed are much more useful. Because Parse::RecDescent processes rules from left to right, grammars can be written in a form that lends itself well to doing these tasks.
Left-associative:
sum : sum /[+-]/ NUM | NUM
Right-associative:
pow : NUM '**' pow | NUM
The following subsections will enrich these grammars with code to build a parse tree and to evaluate the expression at parse-time. As you will see, no changes will be needed to the grammar.
Left-associative:
sum : sum '+' NUM { $item[1] + $item[3] } | sum '-' NUM { $item[1] - $item[3] } | NUM { $item[1] }
Right-associative:
pow : NUM '**' sum { $item[1] ** $item[3] } | NUM { $item[1] }
Left-associative:
sum : sum /[+-]/ NUM { [ @item[2,1,3] ] } | NUM { [ $item[1] ] }
Right-associative:
pow : NUM '**' pow { [ @item[2,1,3] ] } | NUM { [ $item[1] ] }
There is a catch. The theory is solid, but parsers have limitations.
Productions of the form a : a b are called left-recursive. An entire class of parser generators cannot process left-recursive grammars, and Parse::RecDescent belongs to that class. Unfortunately, the left-associative rules presented so far are left-recursive. The remainder of this section will show methods of removing left-recursion from grammars for Parse::RecDescent.
It's easy to parse 4 - 5 + 6 into the list '4', '-', '5', '+', '6'. The following snippet does so:
sum : NUM sum_ { [ $item[1], @{$item[2]} ] } sum_ : /[+-]/ NUM sum_ { [ $item[1], $item[2], @{$item[3]} ] } | { [] }
If we are evaluating at parse-time, we have little choice but to process the sum as a list rather than a binary operator. When building a parse tree, we have two options. We could leave it as is, or we could convert the list into a tree.
The following subsections show how to evaluate the list and how to treeify it.
{ sub eval_sum { my $acc = shift(@_); while (@_) { my $op = shift(@_); if ($op eq '+') { $acc += shift(@_); } elsif ($op eq '-') { $acc -= shift(@_); } } return $acc; } } sum : NUM sum_ { eval_sum($item[1], @{$item[2]}) } sum_ : /[+-]/ NUM sum_ { [ $item[1], $item[2], @{$item[3]} ] } | { [] }
{ sub treeify { my $t = shift(@_); $t = [ shift(@_), $t, shift(@_) ] while @_; return $t; } } sum : NUM sum_ { treeify($item[1], @{$item[2]}) } sum_ : /[+-]/ NUM sum_ { [ $item[1], $item[2], @{$item[3]} ] } | { [] }
This method is the same as Method 1, but takes advantage of a Parse::RecDescent feature to improve readability. Parse::RecDescent has a pair of directives to help build lists. <leftop> is designed to build left-associative lists, and <rightop> is designed to build right-associative lists.
{ sub eval_sum { my $acc = shift(@_); while (@_) { my $op = shift(@_); if ($op eq '+') { $acc += shift(@_); } elsif ($op eq '-') { $acc -= shift(@_); } } return $acc; } } sum : <leftop: NUM /[+-]/ NUM> { eval_sum(@{$item[1]}) }
{ sub treeify { my $t = shift(@_); $t = [ shift(@_), $t, shift(@_) ] while @_; return $t; } } sum : <leftop: NUM /[+-]/ NUM> { treeify(@{$item[1]}) }
Normally, information passes from subrule to superrule. For example, in the following code, rule2 receives the result of rule3. In turn, rule1 receives the result of rule2.
rule1: token rule2 rule2: token rule3 rule3: token
The deeper something is, the sooner it will get executed. In a list, that means the last (right-most) element encountered will be executed first. With left-associative lists, the opposite is needed. With left-associative lists, information needs to flow from the superrule to the subrule. Fortunately, Parse::RecDescent provides a means of passing information to subrules: Subrule argument lists.
Think of each rule as a function, and of each reference to that rule as a function call. (In fact, this is how the compiled grammars are implemented.) Just like functions can have arguments, so can subrules.
sum : NUM sum_[ $item[1] ] sum_ : '+' NUM sum_[ $arg[0] + $item[2] ] | '-' NUM sum_[ $arg[0] - $item[2] ] | { $arg[0] }
sum : NUM sum_[ $item[1] ] sum_ : '+' NUM sum_[ [ $item[1], $arg[0], $item[2] ] ] | '-' NUM sum_[ [ $item[1], $arg[0], $item[2] ] ] | { $arg[0] }
Earlier, we ended up with the following rules for right-recursive binary operators:
pow : NUM '**' pow | NUM
Unlike left-recursion, Parse::RecDescent has no problem with right-recursion. However, Parse::RecDescent handles rules with productions with identical prefixes very inefficiently.
Just like in algebra, we can factor out the common prefix into another rule.
pow : NUM pow_ pow_ : '**' pow |
The complicated part is how to evaluate the expression or build the parse tree when one of the operands is matched by one rule, and the other is matched by a different rule. It turns out that doing this is very similar to eliminating left-recursion.
Just like when eliminating left-recursion, we can build a flat list of the whole chain of powers, and work with that. The difference is that the list will be processed from right to left.
{ sub eval_pow { my $acc = pop(@_); while (@_) { my $op = pop(@_); $acc = pop(@_) ** $acc; } return $acc; } } pow : NUM pow_ { eval_pow($item[1], @{$item[2]}) } pow_ : '**' NUM pow_ { [ $item[1], $item[2], @{$item[3]} ] } | { [] }
{ sub treeify_r { my $t = pop; $t = [ pop, pop, $t ] while @_; return $t; } } pow : NUM pow_ { treeify_r($item[1], @{$item[2]}) } pow_ : '**' NUM pow_ { [ $item[1], $item[2], @{$item[3]} ] } | { [] }
Just like Parse::RecDescent has a directive for creating a flat list for a left-associative operator (<leftop>), it has one to create a flat list for a right-associative operator (<rightop>).
{ sub eval_pow { my $acc = pop(@_); while (@_) { my $op = pop(@_); $acc = pop(@_) ** $acc; } return $acc; } } pow : <rightop: NUM /(\*\*)/ NUM> { eval_pow(@{$item[1]}) }
{ sub treeify_r { my $t = pop; $t = [ pop, pop, $t ] while @_; return $t; } } pow : <rightop: NUM /(\*\*)/ NUM> { treeify_r(@{$item[1]}) }
Let's look at the algebra again. We can change
pow : NUM '**' pow { $item[1] ** $item[3] } | NUM { $item[1] }
into
pow : NUM pow_ pow_ : '**' pow { <<pow's $item[1]>> ** $item[2] } | { <<pow's $item[1]>> }
The problem is that we have to pass $item[1] from pow to pow_. We've already seen that we can pass data from one rule to another using subrule arguments. When eliminating left-recursion, we used the subrule argument to form a stack. When improving right-recursion, we simply pass from the main rule to the helper rule.
pow : NUM pow_[ $item[1] ] pow_ : '**' pow { $arg[0] ** $item[2] } | { $arg[0] }
pow : NUM pow_[ $item[1] ] pow_ : '**' pow { [ $item[1], $arg[0], $item[2] ] } | { $arg[0] }
The following subsections contain complete, working code to parse expressions formed of the +, - and ** binary operators using the Subrule Argument methods. Parentheses are also supported to produce more meaningful results.
In order to support parentheses and to give the operators their proper precedence, the rules used in the upcoming code are slightly different from those seen earlier. Where NUM used to be in the productions, you will now find term (in sum/sum_) and sum (in pow/pow_).
The code of both subsections produce the same output, an uncommented version of the following:
Demonstrates left-associativity 4-5+6 = 5 got 5 (4-5)+6 = 5 got 5 4-(5+6) = -7 got -7 Demonstrates right-associativity 4**3**2 = 262144 got 262144 (4**3)**2 = 4096 got 4096 4**(3**2) = 262144 got 262144
use strict; use warnings; use Parse::RecDescent (); my $grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; } parse : expr /^\Z/ { $item[1] } # Just an alias expr : pow # vvv lowest precedence # pow : sum '**' pow # | sum pow : sum pow_[ $item[1] ] pow_ : '**' pow { $arg[0] ** $item[2] } | { $arg[0] } # sum : sum /[+-]/ term # | term sum : term sum_[ $item[1] ] sum_ : '+' term sum_[ $arg[0] + $item[2] ] | '-' term sum_[ $arg[0] - $item[2] ] | { $arg[0] } # ^^^ highest precedence term : '(' expr ')' { $item[2] } | /\d+/ __END_OF_GRAMMAR__ my $parser = Parse::RecDescent->new($grammar) or die("Bad grammar\n"); foreach my $expr ( '4-5+6', # Demonstrates left-associativity '(4-5)+6', '4-(5+6)', '4**3**2', # Demonstrates right-associativity '(4**3)**2', '4**(3**2)', ) { my $expected = eval $expr; my $got = $parser->parse($expr); print("$expr = $expected got $got\n"); }
use strict; use warnings; use Parse::RecDescent (); my $grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; } parse : expr /^\Z/ { $item[1] } # Just an alias expr : pow # vvv lowest precedence # pow : sum '**' pow # | sum pow : sum pow_[ $item[1] ] pow_ : '**' pow { [ $item[1], $arg[0], $item[2] ] } | { $arg[0] } # sum : sum /[+-]/ term # | term sum : term sum_[ $item[1] ] sum_ : /[+-]/ term sum_[ [ $item[1], $arg[0], $item[2] ] ] | { $arg[0] } # ^^^ highest precedence term : '(' expr ')' { $item[2] } | /\d+/ { [ @item ] } __END_OF_GRAMMAR__ my $parser = Parse::RecDescent->new($grammar) or die("Bad grammar\n"); my %eval = ( term => sub { $_[1] }, '+' => sub { eval_node($_[1]) + eval_node($_[2]) }, '-' => sub { eval_node($_[1]) - eval_node($_[2]) }, '**' => sub { eval_node($_[1]) ** eval_node($_[2]) }, ); sub eval_node { my ($node) = @_; $eval{$node->[0]}->(@$node); } foreach my $expr ( '4-5+6', # Demonstrates left-associativity '(4-5)+6', '4-(5+6)', '4**3**2', # Demonstrates right-associativity '(4**3)**2', '4**(3**2)', ) { my $expected = eval $expr; my $tree = $parser->parse($expr); my $got = eval_node($tree); print("$expr = $expected got $got\n"); }
Update Aug 13, 2006: The examples have been simplified. A right-associative operator is used for the right-associative examples. Parse-time eval was placed before parse tree building. Added section on simplifying right-recursion. Small additions were made here and there to improve clarity. It still needs to link to a tutorial on precedence.
Update Jun 13, 2014: Fixed spelling and grammar mistakes identified by hexcoder.
Update Oct 3, 2016: Fixed indexing problem raised by an anonymous monk.
|
---|