Syntactic Confectionery Delight PerlMonks

### Simple Parser Combinator Implementation

by withering (Monk)
 on Sep 07, 2013 at 04:51 UTC Need Help??

Hello fellow monks,

While there is already Parser::Combinators on CPAN, I write a simple parser combinator demo whose idea is mainly from Hutton92 (Graham Hutton, Higher-Order Functions for Parsing, Journal of Functional Programming 2(3):323–343, July 1992). I try to make it look like boost.spirit - especially the sequence combinator ('then'), the alternative combinator ('alt') and the semantic action combinator ('using').

Here is the code of combinators:

```{
package wraith;

{
package inner_lazy;

sub TIESCALAR {
my (\$class, \$val) = @_;
bless \$_[1], \$class
}

sub FETCH {
my (\$self) = @_;
\$self->()
}
}

'>>' => "then_impl",
'|' => "alt_impl",
'**' => "using_impl";

sub deref {
my @args = @_;
for my \$elt (@args) {
if (ref(\$elt) eq "wraith_rule") {
\$elt = \$\$elt;
}
}
@args
}

our \$concat =
sub {
my @list_of_lists = @_;
my @list;

for my \$elt (@list_of_lists) {
push @list, \$_ for @\$elt;
}
\@list
};

our \$succeed =
bless
sub {
my \$v = \$_[0];
bless
sub {
my \$u = (ref(\$v) eq "ARRAY") ? \$v : [ \$v ];
[ [ \$u, \$_[0] ] ]
}
};

our \$fail =
bless
sub {
[]
};

our \$satisfy =
bless
sub {
my (\$p, \$m) = @_;
\$m = sub { \$_[0] =~ /(.)(.*)/s } if not \$m;
bless
sub {
if (my (\$x, \$xs) = \$m->(\$_[0])) {
if (\$p->(\$x)) {
return \$succeed->(\$x)->(\$xs);
} else {
return \$fail->(\$xs);
}
} else {
return \$fail->( [] );
}
}
};

our \$literal =
bless
sub {
my \$y = \$_[0];
\$satisfy->(
sub {
\$y eq \$_[0]
}
)
};

our \$literals =
bless
sub {
my \$y = \$_[0];
\$satisfy->(
sub {
index(\$y, \$_[0]) != -1
}
)
};

our \$token =
bless
sub {
my (\$tok, \$skip) = @_;
\$skip = '\s*' if not \$skip;
\$satisfy->(
sub { 1 },
sub {
\$_[0] =~ /^\$skip(\$tok)(.*)/s
}
)
};

sub alt_impl {
my (\$p1_, \$p2_, \$discard) = @_;
bless
sub {
my (\$p1, \$p2) = deref(\$p1_, \$p2_);
my \$inp = \$_[0];
\$concat->(\$p1->(\$inp), \$p2->(\$inp))
}
}
our \$alt = bless \&alt_impl;

sub then_impl {
my \$arglist = \@_;
bless
sub {
my (\$p1) = deref(\$arglist->[0]);
my \$inp = \$_[0];
my \$reslist1 = \$p1->(\$inp);
my \$finlist = [];
for my \$respair (@\$reslist1) {
my (\$p2) = deref(\$arglist->[1]);
my \$reslist2 = \$p2->(\$respair->[1]);
for my \$finpair (@\$reslist2) {
push @\$finlist, [ \$concat->(\$respair->[0], \$finpai
+r->[0]), \$finpair->[1] ];
}
}
\$finlist
}
}
our \$then = bless \&then_impl;

sub using_impl {
my (\$p_, \$f, \$discard) = @_;
bless
sub {
my (\$p) = deref(\$p_);
my \$inp = \$_[0];
my \$reslist = \$p->(\$inp);
my \$finlist = [];
for my \$respair (@\$reslist) {
push @\$finlist, [ \$f->(\$respair->[0]), \$respair->[1] ]
+;
}
\$finlist
}
}
our \$using = bless \&using_impl;

sub many_impl {
my \$p = \$_[0];
my \$f;
tie \$f, "inner_lazy", sub { many_impl(\$p) };
\$alt->(\$then->(\$p, \$f), \$succeed->( [] ))
}
our \$many = bless \&many_impl;
}

{
package wraith_rule;

our @ISA = qw ( wraith );

sub makerule {
bless \$_[0]
}

sub makerules {
my (\$class, @args) = @_;
for my \$elt (@args) {
\$elt = makerule(\$elt);
}
@args
}
}

wraith_rule seems ugly. But I have to make it a subclass of the combinator class wraith for convenience of writing rules further. The idea of Hutton92 is not monadic nor memorized. Perhaps I will try those techniques later.

Below is a demo for using the combinators. It demonstrates a simply modified untyped lambda calculus language, which is later translated into perl snippets. It's good to see the great similarity between Perl and lambda calculus - except the eval order :D

```# test case: a lambda-calculus-to-perl translator

my %expr_root = ( "kind" => "list", "defn" => [], "term" => [] );
my \$rootref = \%expr_root;

my (\$formlist, \$form, \$term, \$varlist, \$appterm, \$aterm);
wraith_rule->makerules(\\$formlist, \\$form, \\$term, \\$varlist, \\$appter
+m, \\$aterm);

\$formlist = \$wraith::many->(\\$form);
\$form = ( (\\$term >> \$wraith::token->(';')) **
sub {
[ { "kind" => "term", "body" => \$_[0]->[0] } ]
}
) |
( (\$wraith::token->('[A-Za-z_]+') >> \$wraith::token->('=') >>
+\\$term >> \$wraith::token->(';')) **
sub {
[ { "kind" => "defn", "name" => \$_[0]->[0], "body" =>
+\$_[0]->[2] } ]
}
);
\$term = ( (\\$appterm) ** sub { [ { "kind" => "appl", "body" => \$_[0]->
+[0] } ] } ) |
( (\$wraith::token->('\\\\') >> \\$varlist >> \$wraith::token->('
+\.') >> \\$term) **
sub {
[ { "kind" => "abst", "para" => \$_[0]->[1], "body" =>
+\$_[0]->[3] } ]
}
);
\$varlist = (\$wraith::many->(\$wraith::token->('[A-Za-z_]+'))) ** sub {
+[ \$_[0] ] };
\$appterm = (\$wraith::many->(\\$aterm)) ** sub { [ \$_[0] ] };
\$aterm = ( (\$wraith::token->('\(') >> \\$term >> \$wraith::token->('\)')
+) **
sub { [ { "kind" => "applterm", "body" => \$_[0]->[1] } ] }
) |
( (\$wraith::token->('[A-Za-z_]+')) ** sub { [ { "kind" => "ap
+plvar", "val" => \$_[0]->[0] } ] } );

sub emitabst;
sub emitappl;
sub emitapplterm;
sub emitapplvar;
sub emitterm;
sub emitdefn;

my %emitmethods = (
"term" => \&emitterm,
"defn" => \&emitdefn,
"appl" => \&emitappl,
"abst" => \&emitabst,
"applterm" => \&emitapplterm,
"applvar" => \&emitapplvar
);

sub emitabst {
my \$abstref = \$_[0];
my \$params = \$abstref->{"para"};
my \$nparams = @\$params;
my \$c_param = shift @\$params;
my \$codefrag = undef;
if (\$nparams) {
\$codefrag .= "sub { my \\$\$c_param = \\$_[0]; ";
}
if (@\$params) {
\$codefrag .= emitabst(\$abstref);
} else {
\$codefrag .= \$emitmethods{\$abstref->{"body"}->{"kind"}}->(\$abs
+tref->{"body"});
}
\$codefrag.' }'
}

sub emitappl {
my \$applref = \$_[0];
my \$oplist = \$applref->{"body"};
my \$codefrag = undef;
while (@\$oplist) {
my \$opitr = shift @\$oplist;
\$codefrag .= '( ';
}
\$codefrag .= \$emitmethods{\$opitr->{"kind"}}->(\$opitr);
\$codefrag .= ' )';
}
if (@\$oplist) {
\$codefrag .= '->';
}
}
\$codefrag
}

sub emitapplterm {
my \$atermref = \$_[0];
\$emitmethods{\$atermref->{"body"}->{"kind"}}->(\$atermref->{"body"})
}

sub emitapplvar {
my \$varref = \$_[0];
'\$'. \$varref->{"val"}
}

sub emitterm {
my \$termref = \$_[0];
\$emitmethods{\$termref->{"body"}->{"kind"}}->(\$termref->{"body"})
}

sub emitdefn {
my \$defnref = \$_[0];
'my \$' . \$defnref->{"name"} .' = '. \$emitmethods{\$defnref->{"body"
+}->{"kind"}}->(\$defnref->{"body"}) .';'
}

my \$res = \$formlist->('true = \x y.x; x x y; Y = \f.(\x y.f (x x)) (\x
+ y. f (x x));');
for my \$itr (@{\$res->[0]->[0]}) {
if (\$itr->{"kind"} eq "term") {
push \$expr_root{"term"}, \$itr;
} else {
push \$expr_root{"defn"}, \$itr;
}
}

my (\$defnlist, \$termlist) = (\$rootref->{"defn"}, \$rootref->{"term"});

print "# defnlist: \n";
for my \$defnitr (@\$defnlist) {
print emitdefn(\$defnitr); print "\n";
}

print "# termlist: \n";
for my \$termitr (@\$termlist) {
print emitterm(\$termitr); print "\n";
}

The complete code is available at https://github.com/Akvelog/a-little-five/blob/master/_5wraith.pl

Any advise is welcome. Feel free to correct any of my mistakes (especially usage of English for I am not a native speaker).

-------- update --------

For people not familiar with parser combinators:

Parser combinators are functions which take a string and return a intermediate analysis result -- AST, tuple, calculation result, .etc. Their parsing ability is slightly better than LL(1) parsers -- they are able to process products with common prefix but still die on left-recursive grammars.

The usage of combinators is quite simple: just rewrite the EBNF, replacing each symbol with a combinator (basic combinator or a compsite combinator generated by applying operators 'then', 'alt', 'many', .etc). The combinator corresponding to the start symbol of the grammar is the expected parser. Apply a string to the parser and there will be a list of analysis results:

( (result_1, input_left_1), (result_2, input_left_2), ... (result_n, input_left_n) )

where input_left_i is the unprocessed input string. Obiviously, the results with empty unprocessed input strings are correct. If the language is not ambiguous, there will be only one correct result.

Two of the well-known parser combinator implementations are boost.spirit (c++) and parsec (haskell). Readers may found some interesting info or tutorials in their documentation.

People not familiar with lambda calculus may refer to http://perl.plover.com/lambda/ . This article demonstrates how Perl itself contains the (untyped) lambda calculus with code demo, essays and slides. Readers may found it interesting.

-------- update Sep 17 2013 --------

I made it a CPAN module Wraith. All suggestions and criticisms are welcome.

Replies are listed 'Best First'.
Re: Simple Parser Combinator Implementation
by roboticus (Chancellor) on Sep 07, 2013 at 22:10 UTC

That looks like it might be very interesting--I like playing with parsing, so I typically read all nodes that talk about it. Unfortunately, I'm not conversant with functional programming, nor familiar with the Hutton paper. So I can't really dig into it unless I'm willing to do a bit of digging.

If you want people to take a closer look at it you might need to add a couple of links to your post to the relevent background information. A sentence or two describing why and how we would use such a thing would be even better.

Having said that, I find the first block of code to be mostly clear and easy to read. If I were going to try to work on it, it doesn't look like it would be too difficult to maintain. There are a few odd variable names, but the way you're using other variable names, I'm guessing that they're relatively obvious contractions to someone familiar with the problem domain.

The only criticisms I can offer at this time are:

• In some functions the variable names are far too short to be meaningful. Giving more meaningful names to the variables here and there would make the code even easier to read.
• I may be missing something, but it appears that you're trying to bless some code references, but not providing the class to bless the code reference into. I don't do much object-oriented coding in perl, so it could easily be that I'm simply missing something.

That's the best I can do. Without enough background, I'm afraid the second code block is rather difficult for me to read. I can decipher bits of the syntax, but I can't see why or how you're tying it all together.

I hope you find this somewhat helpful.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Thanks a lot for your criticisms!

Though the paper can be found through Google Scholar, I will add some simple introduction for the combinators later -- try to make them short :D

The odd or short variable names, as you said, are just mathematic-style names, whose meaning is clear to people familiar with the parser combinator theory. I will rename them in the later version of my local clone.

As for the blessings, please refer to perldoc bless. There is the second form 'bless REF', where CLASSNAME is omitted and the current package is used.

The former test case I used is a expression calculator:

```my (\$expn, \$term, \$factor, \$num);
wraith_rule->makerules(\\$expn, \\$term, \\$factor, \\$num);

\$expn = ( (\\$term >> \$wraith::token->('\+') >> \\$expn) ** sub { [ \$_[0
+]->[0] + \$_[0]->[2] ] } ) |
( (\\$term >> \$wraith::token->('-') >> \\$expn) ** sub { [ \$_[0]
+->[0] - \$_[0]->[2] ] } ) |
( \\$term );

\$term = ( (\\$factor >> \$wraith::token->('\*') >> \\$term) ** sub { [ \$_
+[0]->[0] * \$_[0]->[2] ] } ) |
( (\\$factor >> \$wraith::token->('\/') >> \\$term) **
sub { \$_[0]->[2] ? [ \$_[0]->[0] / \$_[0]->[2] ] : [] } ) |
( \\$factor );

\$factor = ( (\\$num) ** sub { my \$args = \$_[0]; my \$val = undef; for my
+ \$elt (@\$args) { \$val .= \$elt; } [ \$val ] } ) |
( ( \$wraith::token->('\(') >> \\$expn >> \$wraith::token->('\)
+') ) ** sub { my \$args = \$_[0]; [ \$args->[1] ] } );

\$num = \$wraith::token->('[1-9][0-9]*');

print \$expn->('2 + (4 - 1) * 3 + 4 -2')->[0]->[0]->[0], "\n";

The corresponding BNFs are

E -> T + E | T - E | T,

T -> F * T | F / T | F,

F -> num | ( E ),

where num is a terminal symbol (a natural number).

The overloaded operator >> means sequence, i.e, A >> B means the concatenation of A and B. Operator | has the same meaning with BNF operator | (alternative). Perl operator ** is overloaded for semantic action, e.g, ALPHA ** sub { ... } means that when product ALPHA is correctly matched, the second operand of ** is executed, with its only argument being a reference to a list of return values of each term (terminal or nonterminal symbol) in product ALPHA. Just like what we use in YAPP except there is only one argument to the semantic action sub.

Re: Simple Parser Combinator Implementation
by raiph (Chaplain) on Sep 12, 2013 at 15:23 UTC
I would love to see a dialog comparing withering's work with Perl 6's comparable features. If you know both, please summarize how they are similar and how they are different. If you know just parser combinators, or just Perl 6, please ask some questions to get things rolling. Here's hoping someone jumps in and we have an informative exchange. :)

That's a good point. I searched Perl 6 PEG and found it really powerful. Glad that PEG is now a core part of the language - it's rather exciting to write embedded DSLs with PEG, and if the implementation uses memoization, we could expect a better performance than native parser combinators ( and no worse than memoized combinators ) and a little more convenience than Frost's memoized combinators, which need some wrappers for handle real-world tokens.

I'm not sure whether PEG in Perl 6 can handle ambiguous grammars or not. Parser combinators were designed to parse natural language sentences at first but were also found useful for parsing programming languages later. It seems not so necessary for PEG to choose the first match if the implementation can endure the memory cost when generating the intermediate representation.

I suspect no one's going to bite here at PM. I encourage you to visit with Perl 6 folk on the #perl6 IRC channel and ask about Perl 6's PEG, its current and anticipated future performance, Graham Hutton's work, boost.spirit, your code, and development of related modules on CPAN for Perl 5. Larry Wall is often on the channel with the nick TimToady and I'm fairly confident he'll be delighted to discuss such matters. And you might bump into Patrick Michaud who has written numerous grammar engines over the years to match Perl 6's evolving design.

(I read and summarize the daily logs so I'll see your discussion there.)

Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1052799]
Approved by Corion
Front-paged by ambrus
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2018-01-24 06:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How did you see in the new year?

Results (256 votes). Check out past polls.

Notices?