Parse::RecDescent

Item Description: Lex/Yacc-like grammar parser for perl

Review Synopsis:

Item Description: Provides a yacc/lex-like functionality for parsing input data into Perl.
Review Synopsis: Replicates most of the features of yacc/lex minus a few shortcomings with addition benefits of using perl to make parsing complex input rather simple.
Module Author: Damian Conway
Module Homepage: http://www.cs.monash.edu.au/~damian

Description

Those that have dealt with free-form input files, which include nearly all programming languages in addition to other third party programs, probably have had to deal with parsing such input using tools called lex (LEXical Analyzer Generator) and yacc (Yet Another Compiler Compiler) or their GNU cousins, flex and bison. Using a combination of these tools, one could derive a grammar specification and use that to place data into data structures. These tools are common around C and C++ folk.

However, with perl and it's strong regex engine, it's possible to replicate most input functions with advanced regexs, and thus, there usually isn't much need to worry about grammar specifications. However, there can be times where you just cannot do what you need to do with regexs without additional logic, but would be well-suited to using a grammar to interpret data. For example, given

names = John Joe "Mary Sue" James
[download]

you cannot simply get every name after the '=' by splitting on spaces using a regex. But with a grammar parser, this problem is simple.

Introducing Parse::RecDescent, a grammar parser for Perl. It mainly works by calling two functions: a new routine with the desired grammar in order to prepare the parser, and a startrule which takes in the input to be processed and returns whatever results are needed. There are other functions available to swap rules in or out of the grammar set if needed, but most of the functionality is built into the initial description of the grammar rules set.

For example, consider simple math problems. The grammar will look something like this:

startrule:             calculation
statement:             grouped_op | number
grouped_op:            '(' calculation ')'
calculation:           add_calculation | subtract_calculation
add_calculation:       statement '+' statement
subtract_calculation:  statement '-' statement
number:                /\d+/
[download]

Basically, the text on the left are rules, and the statements on the right are what the parser expects to see when it is expecting those rules. Note that these can be recursive and the like; thus if I'm looking for a 'statement', and see a 'group_op', I could see yet another 'statement' inside that 'group_op'. Parse::RecDescent uses a special rule called "startrule" as the default starting point, though you can tell it to use a different rule if you desire. At the very basic level, Parse::RecDescent can determine if the input matches that grammar, and returns true or false:

LISTING 1

#!/usr/bin/perl -wT

use strict;
use Parse::RecDescent;

my $parser = new Parse::RecDescent( q{
startrule:             calculation eofile
statement:             grouped_op | number
grouped_op:            '(' calculation ')'
calculation:           add_calculation | subtract_calculation
add_calculation:       statement '+' statement
subtract_calculation:  statement '-' statement
number:                /\d+/
eofile:                /^\Z/
} );

my @tests = (
             "3",                    # BAD
             "4 + ",                 # BAD
             "4 + 8",                # GOOD
           "4 + 8 +",              # BAD
             "6 + (3 - 2)",          # GOOD
             "6 + ()3 - 2",          # BAD
             "(6 + (3 - 2)) + 1",    # GOOD
             "(6 + (3 - 2) + 1)",    # BAD
             "(6 + 3"                # BAD
             );

foreach my $test ( @tests ) {
     print $test . ': ' .
         ( defined($parser->startrule( $test ) )?"good":"bad") . "\n";
}
[download]

Obviously, my math here is not perfect, but for demonstration purposes, this is sufficient (One can play with grammar files indefinitely to continue to improve performances). One thing of note in the grammar specs is the use of 'eofile'; because of how Parse::RecDescent treats data (namely, trying to go left to right without any backtracking), it will make the most non-greedy match (think '.*?') it can, and once matched, it will be happy. So, as in case 4, "4 + 8 + ", without 'eofile' in the grammar rules, the parser will recognized "4 + 8" and be happy with that, the trailing plus sign ignored. The 'eofile' basically forces the match to work on the entire text string and not just the first parts it can match.

Now, this might seem like overkill, as pattern matching is easily done by regexes. However, behind most of Parse::RecDescent is a powerful way to manipulate values that have been recognized, either for in-line processing or later evaluation. Without specifying any of these extra actions, the default behavior of Parse::RecDescent is to return the value of the last item matched in the rule. However, if you simply want a tree-like structure of how the input is defined, one can include a pragma "<autotree>" at the top of the grammar string that, when parsing, will return a hash with all the data in it. Alternatively, you could perform in-line calculations which is well suited to our sample program above...

LISTING 2

#!/usr/bin/perl -wT

use strict;
use Parse::RecDescent;

my $parser = new Parse::RecDescent( q{
startrule:             calculation eofile  { $return = $item[1]; }
statement:             grouped_op | number
grouped_op:            '(' calculation ')' { $return = $item[2]; }
calculation:           add_calculation | subtract_calculation
add_calculation:       statement '+' statement { $return = $item[1]+$i
+tem[3]; }
subtract_calculation:  statement '-' statement { $return = $item[1]-$i
+tem[3]; }
number:                /\d+/
eofile:                /^\Z/
} );

my @tests = (
             "4 + 8",
             "6 + (3 - 2)",
             "(6 + (3 - 2)) + 1",
             );

foreach my $test ( @tests ) {
     print $test . ': ' . $parser->startrule( $test ) . "\n";
}

# Output is:
#4 + 8: 12
#6 + (3 - 2): 7
#(6 + (3 - 2)) + 1: 8
[download]

Demonstrated here are two of the special variables that are used in defining these actions. @item is a list of the values of what actually matched the rule, starting with 1 and going up, and including each 'constant' as well (eg, in add_calculation, $item[2] would be '+' regardless of other arguments.) $return is the value that will be passed up to previous rules as the value for that rule. By default, as mentioned above, if there is no special action rule, the value of the last item matched will be passed along. As you can see here, I've created a very simple calculator without doing any of the logic within perl itself; you could easily extend this to do much more advanced calculations.

There are many more features that will benefit those working with grammars. There are a number additional features of specifying the grammar itself, including matching on repeated items ala regex, look-aheads, passing matched values to further rules, as well as start-up and default actions. Within actions, there are a number of other special variables that are useful for parsing statements, including current line and column numbers for error reporting.

An addition feature of Parse::RecDescent are directives that are included in the grammar portion of the system. One has already been mentioned, <autotree>, which automatically builds a hash tree of the parsed data, but there are a number of others. There are ones that can be used to parse strings or perl code, ones to provide scoring to determine which rules of multiple ones should be used, error reporting directives, conditional directives, and others to help improve the speed of parsing by bailing out of guaranteed failed matches.

A final feature of Parse::RecDescent is the ability to set debugging flags. These are named as $::RD_<feature>, and can be used to influence the behavior of the engine or to output what rules and steps it is currently processing.

Most of these features are layed out well in the included documentation, although it is expected that you have had previously knowledge of using grammars prior to using the module. As the module is written in pure Perl, it should be usable on all platforms that can run Perl.

Benefits

The grammar specification has tried to stick as close as possible to lex/yacc's grammar definitions as to make the use of existing grammars easy within Parse::RecDescent. In addition, several additional features have been added to make grammar work in a more perl-like way, making it very easy to add such features to an existing perl framework.

Pitfalls/Caveats

One thing that has been mentioned is that unlike perl's regex engine, which is 'greedy' and tries to consume as much as the string as possible when matching, Parse::RecDescent is the opposite and tends to be less greedy. This may cause problems such as letting trailing characters go through or missing parts of matches. Grammars that worker perfectly in lex/yacc may not work straight into Parse::RecDescent.

Conclusions

If you have any data input that is free form, then you should be using a grammar system to read that data into your program to avoid pitfalls with balancing quotes, whitespace, and other features. Parse::RecDescent provides an excellent equivalent for lex/yacc within perl, and assuming you have previous knowledge of using grammars, you'll be up and running this in no time.

Comment on Parse::RecDescent Select or Download Code

Replies are listed 'Best First'.
Re: Parse::RecDescent by clemburg (Curate) on Apr 23, 2001 at 13:36 UTC
It should be emphasized that Parse::RecDescent is strictly more powerful than the lex/yacc approach. E.g., it is relatively simple to write a parser that reconfigures its parsing rules on the fly, based on its input. All in all, if you are familiar with Perl and parsing, Parse::RecDescent is the way to go, especially for self-modifying systems. Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply]
Re: Parse::RecDescent by princepawn (Parson) on Apr 23, 2001 at 21:35 UTC
Parse::RecDescent has its place, but even its author opted for YAPP when doing the parsing for Lingua::Romana::Perligata for reasons discussed in last year's Perl 5 Conference. Top-down parsing is good for changing the overall structure of parse phrasing. Bottom up makes hardcore tokenizing easier, and that's why Damian opted for a module other than even his own in this case.	[reply]
Re: Re: Parse::RecDescent by Masem (Monsignor) on Apr 24, 2001 at 02:07 UTC
There are still places where you'd want to use Parse::RecDescent over YAPP, based on what I've looked into the latter, a couple which concern me for a project I'm working on; first, you cannot match code blocks if the outside language is not pure perl, unless you insert the entire perl grammar tree within your recipe. This is not a fun prospect. Secondly, YAPP requires you to work with outside tools to generate .pm files that basically become your parser; this doesn't allow for dynamically changing rules or easily creating new parser sets on the fly. Mind you, this latter condition is probably a rarity in terms of programmers needs; most programs that use such parsers will have a fixed grammer where the improved reliability of YAPP will play out much better. (Maybe I'll play with that as well and put up a comparible review). Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain	[reply]
Re: Parse::RecDescent by webfiend (Vicar) on Mar 19, 2009 at 16:55 UTC
This seems like a good place to mention that you'll need to place `use` statements in a startup block to take advantage of any Perl 5.10-isms. Code slightly modified from an example in Advanced Perl 2nd Edition: use Modern::Perl; use Parse::RecDescent; my $grammar = q{ { use Modern::Perl } GameTree : "(" Sequence GameTree(s?) ")" Sequence : Node(s) Node : ":" Property(s) { say "I saw a node!" } Property : PropIdent PropValue(s) PropIdent : /[A-Z]+/ PropValue : { extract_bracketed($text, '[ ]') } }; my $sgf_parser = Parse::RecDescent->new($grammar); undef $/; my $sgf = <DATA>; say $sgf_parser->GameTree($sgf); __DATA__ (:GM[1]FF[4]AP[CGoban:2]ST[2]RU[Japanese] PW[Honinbo Shuwa]PB[Yasuda Shusaku] WR[7d]BR[5d] :B[qd]:W[dc]:B[pq]:W[oc]:B[cp]:W[qo] :B[pe]C[This is the famous Shusaku opening".]) [download] I could be wrong, of course. This is how I got the thing to run with `say` in the action, though.	[reply] [d/l] [select]
Re^2: Parse::RecDescent by ikegami (Patriarch) on Mar 19, 2009 at 17:20 UTC
No, you're right. What's in the top-level curlies ends up being top-level code in the package. My parsers always look like: `use strict; use warnings; my $grammar = <<'__EOI__'; { use strict; use warnings; } ... __EOI__` [download] Note the use of heredocs. Much better than `q{}` since you don't have to escape some backslashes and some curlies. For example, to have a production match a single backslash, Single quote heredoc: `rule: /\\/` Single quote literal: `rule: /\\\/` or `rule: /\\\\/`	[reply] [d/l] [select]


laziness, impatience, and hubris
	PerlMonks