Parse::RecDescent Grammar Fun

ichimunki has asked for the wisdom of the Perl Monks concerning the following question:

Below is my code. I am happy with my grammar except that if a token is surrounded by punctuation then the token gets slurped into it. I want the third line of $text to result in ... <punct: ..><link: spaced><punct: ..> ..., not in ... <punct: ..[[><word: spaced><punct: ]]> ... (as happens currently). Anybody know how I can fix it? Also, if it looks like my grammar is screwy, or there are other things I can do to improve this, let me know-- I am just getting my arms around the basics of this very cool module, so pointers will be appreciated. Thanks!

#!/usr/bin/perl -w
use strict;
use Parse::RecDescent;

my $grammar = join( '', <DATA> );

my $parser = Parse::RecDescent->new( $grammar ) or
    die "Error: Bad grammar\n";

#while(<STDIN>){ $text .= $_; }
my $text =<< "SUB_STDIN";
A nicely [[spaced]] link.
A poorly[[spaced]]link.
Another poorly..[[spaced]]..link.
SUB_STDIN

my $results = $parser->startrule( $text ) or
    die "Error: Bad text\n";

__DATA__

startrule: <skip:''> bit(s) 
bit: eol | word | space | token | punct

eol: /\n[ \t]*/ {print "<newline>\n" }
space: /[ \t]+/ {print "< >" }
word: /[\w\']+/ {print "<word: $item[1]>" }
punct: /[^\w\s]+/ {print "<punct: $item[1]>" }

token: link
link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }
[download]

Comment on Parse::RecDescent Grammar Fun Select or Download Code

Replies are listed 'Best First'.
Re: Parse::RecDescent Grammar Fun by hsmyers (Canon) on Jul 24, 2002 at 20:53 UTC
Mostly your immediate problem is that you told the grammar that [ and ] are 'punct' so it happily eats the delimiter for 'link'! Try: `punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" }` [download] Note the inclusion of [ and ] (escaped) in the definition of what 'punct' is 'not'. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l]
Re: Re: Parse::RecDescent Grammar Fun by ichimunki (Priest) on Jul 24, 2002 at 21:13 UTC
Ah, but they are punct if they appear as [ or ] (singles). Also, they qualify as punct if they appear as pairs out of order: ]] stuff [[. They would also be punct if you just had [[ and no matching ]]. They would also be punct if they appeared as [[]]-- you can't have a link to absolutely nothing. :) So far I'm solving the problem by simply forcing punct to be a single character instead: `/[^\w\s]/`. That way the punct rule doesn't eat up my string before it can run the token test on relevant pieces, since token is checked before punct. What I was kind of hoping for was a way to use a negative lookahead of some sort... or a way to build an extra rule layer that would somehow account for this.	[reply] [d/l]
Re: Re: Re: Parse::RecDescent Grammar Fun by hsmyers (Canon) on Jul 24, 2002 at 22:51 UTC
This may not be all that robust, but it is better than square one... #!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. A harder problem [spaced] link. Yet harder still..]]spaced[[..link. The real problem is [[spaced]] followed by ]] or [[ link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol \| word \| space \| token \| punct eol: /\n[ \t]/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" } \| /(?<!\[)\[(?!\[)/ {print "<punct: $item[1]>" } \| /(?<!\])\](?!\])/ {print "<punct: $item[1]>" } \| /(?<!\[\[)\]\]/ {print "<punct: $item[1]>" } \| /(?<!\]\])\[\[/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" } [download] Notice:* new test cases for bracket as 'punct'. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply] [d/l]
Re: Re: Re: Re: Parse::RecDescent Grammar Fun by ichimunki (Priest) on Jul 25, 2002 at 00:52 UTC
Re: Parse::RecDescent Grammar Fun by Abigail-II (Bishop) on Jul 25, 2002 at 11:25 UTC
Your problem is the greedyness of punct. If it's ok that `!@#$` will be parsed as four tokens, you should be able to get away with: `punct: /[^\w\s]/ {print "<punct: $item[1]>" }` [download] although I have not tested it. Otherwise, you may want to try something like (also untested): `punct: /(?:[^\w\s[]+\|\[[^\w\s[]*\|\[\[(?!.+?]]))+/ {print "<punct: $item[1]>"}` [download] Note that your link rule consumes `"[[]]]"` completely, as a leading `[[`, a `]` as the content, and a trailing `]]`. Similary, `"[[]] foo [[]]"` is consumed by the link rule completely, with `"]] foo [["` as the part inside the 'link'. Abigail	[reply] [d/l] [select]
Re: Re: Parse::RecDescent Grammar Fun by ichimunki (Priest) on Jul 25, 2002 at 14:21 UTC
Yes, I think letting punct be single character tokens is the way to go here. That regex is probably intuitive to some, but it looks like a maintenance nightmare. Thanks for the info on the link rule. I had completely forgotten to stop to think about what all that would match.	[reply]


Just another Perl shrine
	PerlMonks