Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Reversible parsing (with Parse::RecDescent?)

by ikegami (Pope)
on Mar 14, 2008 at 22:18 UTC ( #674305=note: print w/ replies, xml ) Need Help??


in reply to Reversible parsing (with Parse::RecDescent?)

I haven't thought of this topic much before you brought it up and I haven't researched it, so I'm making it up as I go along. However, I do have experience with parsing and with P::RD specifically, so hopefully you'll find it useful.


  • Is it necessarily possible to reverse the data through the grammar to get the command line?
  • or is this only possible for special grammars?
  • Is there something I need to in my grammar to make this possible?

Yes, it's possible to deparse a parse tree, if the parse tree contains sufficient information. That means that your parser must return sufficient information to regenerate an acceptably similar source. (e.g. If your language accepts single-quoted and double-quoted string literals, you may not care which one is used in the deparsed text as long as the resulting text is equivalent to the original. You probably don't care about whitespace and comments either.) With P::RD, that is controlled by the grammar.

  • Is there a module that already reverses the parsing of Parse::RecDescent parsers?
  • Is there another module that makes this bi-directionality easier (I'd prefer not to learn another grammar specification syntax, but could)?
  • Is there a module that already does all this soup-to-nuts?

I'm afraid the grammar couldn't be the same as a the one used by P::RD. Actions ({ ... }) and many directives (<...>) cannot be automatically reversed. Error checks would be different. etc.

Now, something like BNF could be automatically reversed, so you could write a parser and deparser that accepted pure BNF. The problems with this approach is that the product returned by the parser would be of limited use. It would require further processing. This is what the P::RD lumps in with the grammar, and that's what's not automatically reversible.


Next, let's address some issues with the grammar.

Bugs:

  • Use of captures is undocumented and unnecessary. Use "{ $item[1] }" instead of captures and "{ $1 }".
  • "/ident/" incorrectly matches the start of "identx".
  • You don't check if the end of input was reached, so trailing errors in your input can be ignored. In fact, it happens in one of your test cases: ".dfo" is silently ignored in "hap -b sknxharvest01 -enc testfile.dfo". Note the "/\Z/" I added.
  • A common mistake is to assume the strict and warnings settings from the caller are used for the parser. You need to include these in the code block in the grammar if you want them.

Sketchy Code:

  • No reason to compile the grammar every time.
  • "rule : subrule { $item[1] }" is the same as "rule : subrule".
  • Each rule invoked adds a fair bit of time to the overhead when using P::RD, so I inlined the trivial ones.
  • Something that's optional should be made optional by a rule, not by matching a zero-length string. (Re: optvalue)

Stylistic:

  • I renamed "startrule" to "parse" since I find "$parser->parse($text)" appealing.
  • Shortened "command", "pacakge" and "option" to their well known abbreviations: "cmd", "pkg" and "opt".
  • I seperated token rules from other rules and uppercased them. P::RD doesn't distinguish between tokens and higher-level rules, but there tends to be subtle differences between them. For example, I tend to avoid including $item[0] in the returned value for tokens, while I tend to include it for other rules.
  • I align : and |. I find this helps readability a lot.
# build_parser.pl use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'__END_OF_GRAMMAR__'; { # Lexical pragmas and lexcial variables # here are in the same scope as the sub # created for each grammar rule. use strict; use warnings; my %KNOWN_COMMANDS = map { $_ => 1 } qw( hap hccmrg hup hspp hpp hpg hdp hdlp hcp ); } parse : cmdline /\Z/ { $item[1] } cmdline : pkgcmd opt(s?) { [ @item[1,2] ] } pkgcmd : IDENT pkgcmd_[ $item[1] ] { $item[1] } pkgcmd_ : { $KNOWN_COMMANDS{$arg[0]} } | <error:Unknown command> opt : FLAG opt_ { [ @item[1,2] ] } opt_ : IDENT | { 1 } FLAG : /-\w+/ IDENT : /\w+/ __END_OF_GRAMMAR__ Parse::RecDescent->Precompile($grammar, 'Parser') or die("Bad Grammar\n");

In this post, I show how to deparse the tree created by the above grammar. For more complex trees, it can be very useful to have each non-token rule returned a bless object. Then some method in the class would be responsible for deparsing itself instead of relying on Deparse to do it. I'll get to that in a separate post.

A quick note before contining: You'll notice the parse tree doesn't differentiate between "-opt" and "-opt 1", so you'll get "-opt 1" when you deparse it. This may be acceptable, where it would be a case of being "acceptably similar" as mentioned above. If it's not acceptable, then it's a case of the parse tree not holding enough information as is.

Basically, I created a function for each rule. The function resemble the associated rule, but deparses it instead. Note that your grammar is very trivial (has no choices in it really), which is why you didn't need to use $item[0] (or some other constants) anywhere, and you don't have to make any choices when deparsing.

# Deparse.pm use strict; use warnings; package Deparser; my $skip = ' '; sub add_skip { if (length($_[0])) { return $skip . $_[0]; } else { return ''; } } sub loop(&@) { my $cb = shift(@_); return join $skip, map { $cb->() } @_; } sub deparse { &cmdline } sub cmdline { my ($node) = @_; my ($pkgcmd, $opts) = @$node; my $text = pkgcmd($pkgcmd); $text .= add_skip(loop { opt($_) } @$opts); return $text; } sub pkgcmd { my ($node) = @_; return IDENT($node); } sub opt { my ($node) = @_; my ($flag, $val) = @$node; my $text = FLAG($flag); $text .= add_skip(IDENT($val)); return $text; } sub FLAG { return $_[0] }; sub IDENT { return $_[0] }; 1;

As you can see, it's not too hard. Just make sure to reflect changes to your parser with changes to your deparser.

Now let's use the code.

# test.pl use strict; use warnings; use Data::Dumper qw( Dumper ); use Parser qw( ); use Deparser qw( ); { my $parser = Parser->new(); while (<DATA>) { chomp; print("Unparsed: $_\n"); my $tree = $parser->parse($_) or do { print("Bad data: $_\n"); next; }; my $dumper = Data::Dumper->new([ $tree ]); $dumper->Useqq(1); $dumper->Terse(1); $dumper->Indent(0); print("Parsed: ", $dumper->Dump(), "\n"); my $deparsed = Deparser::deparse($tree); print("Deparsed: $deparsed\n"); } continue { print("\n"); } } __DATA__ hap -b hap -b sknxharvest01 hap -b sknxharvest01 -enc testfile.dfo hap -b sknxharvest01 -usr cgowing -pass chaspass hap -badflag hap -prompt hap -b sknxharvest01 -prompt hap -prompt -b sknxharvest01

Results:

>perl build_parser.pl >perl test.pl Unparsed: hap -b Parsed: ["hap",[["-b",1]]] Deparsed: hap -b 1 Unparsed: hap -b sknxharvest01 Parsed: ["hap",[["-b","sknxharvest01"]]] Deparsed: hap -b sknxharvest01 Unparsed: hap -b sknxharvest01 -enc testfile.dfo Bad data: hap -b sknxharvest01 -enc testfile.dfo Unparsed: hap -b sknxharvest01 -usr cgowing -pass chaspass Parsed: ["hap",[["-b","sknxharvest01"],["-usr","cgowing"],["-pass"," +chaspass"]]] Deparsed: hap -b sknxharvest01 -usr cgowing -pass chaspass Unparsed: hap -badflag Parsed: ["hap",[["-badflag",1]]] Deparsed: hap -badflag 1 Unparsed: hap -prompt Parsed: ["hap",[["-prompt",1]]] Deparsed: hap -prompt 1 Unparsed: hap -b sknxharvest01 -prompt Parsed: ["hap",[["-b","sknxharvest01"],["-prompt",1]]] Deparsed: hap -b sknxharvest01 -prompt 1 Unparsed: hap -prompt -b sknxharvest01 Parsed: ["hap",[["-prompt",1],["-b","sknxharvest01"]]] Deparsed: hap -prompt 1 -b sknxharvest01

Time for a break! Blessed version later.


Comment on Re: Reversible parsing (with Parse::RecDescent?)
Select or Download Code
Re^2: Reversible parsing (with Parse::RecDescent?)
by goibhniu (Hermit) on Mar 17, 2008 at 01:38 UTC

    First of all, thanks to Joost and Blockhead for your input above. It's a valuable, helpful discussion, but I think the bulk of my response is to ikegami, here.

    ikegami, your work is VERY much appreciated. I've always admired the elegance of your code, and this is no exception. I have learned alot about writing P::RD grammars by just skimming yours, above, and know that with study I can learn more.

    One thing I keep thinking both in my own development process on this project and in reading your (and Joost's and blockhead's) responses is that I'm definitely in a simplistic, proof-of-concept phase on this. You've pointed out that my grammar is currently trivial, and I think that is partly why. My concern is in the general case where I may have a much expanded and much less trivial grammar to maintain. At this point your tricks and style suggestions will go a long way toward helping with that; thank you. When I realized the error of my assumptions with P::RD::Deparse, I thought I would end up writing a function or functions to deparse it manually, and your examples present a good organization for doing so.

    The question that still nags me is whether its necessary. I think your point about "-opt" and "-opt 1" prooves that a grammar is not necessarily reversible (and, I think, provides a problematic example for the joost/blockhead thread, above). Two different strings parse to the same parse tree. I could try to fix these on a case-by-case basis, but wonder if there's a general strategy for writing a grammar to avoid them (especially since I'm free to change the data structure emitted by the grammar if I please), or a general property of a grammar that could detect them (possibly for use in a mythic P::RD::ReverseParse for detecting error conditions in the input grammar). <update> What I'd like to avoid is making changes in two places; on the grammar side and on the deparse function side. </update>

    In the general case, you propose that P::RD Actions and Directives along with error checking make it not possible. What would be an example case to illustrate the limitations of a BNF parser (and is there one already written on CPAN? the closest hit I found was Parse::Marpa)?

    <update index='2' date='17-Mar-2008'> I stumbled across Tree::Parser, which seems to take coderefs to a parse subroutine separate from a deparse subroutine. This is consistent with ikegami's solution and is another example that contradicts my laziness in wanting to maintain the parse / deparse rules in one place. Perhaps I should give up and just write the deparse versions. </update>


    #my sig used to say 'I humbly seek wisdom. '. Now it says:
    use strict;
    use warnings;
    I humbly seek wisdom.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://674305]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (11)
As of 2014-07-30 18:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (239 votes), past polls