Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Parse::RecDescent Grammar Fun

by ichimunki (Priest)
on Jul 24, 2002 at 17:57 UTC ( #184966=perlquestion: print w/ replies, xml ) Need Help??
ichimunki has asked for the wisdom of the Perl Monks concerning the following question:

Below is my code. I am happy with my grammar except that if a token is surrounded by punctuation then the token gets slurped into it. I want the third line of $text to result in ... <punct: ..><link: spaced><punct: ..> ..., not in ... <punct: ..[[><word: spaced><punct: ]]> ... (as happens currently). Anybody know how I can fix it? Also, if it looks like my grammar is screwy, or there are other things I can do to improve this, let me know-- I am just getting my arms around the basics of this very cool module, so pointers will be appreciated. Thanks!
#!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; #while(<STDIN>){ $text .= $_; } my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol | word | space | token | punct eol: /\n[ \t]*/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s]+/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }

Comment on Parse::RecDescent Grammar Fun
Select or Download Code
Replies are listed 'Best First'.
Re: Parse::RecDescent Grammar Fun
by hsmyers (Canon) on Jul 24, 2002 at 20:53 UTC
    Mostly your immediate problem is that you told the grammar that [ and ] are 'punct' so it happily eats the delimiter for 'link'! Try:
    punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" }
    Note the inclusion of [ and ] (escaped) in the definition of what 'punct' is 'not'.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      Ah, but they are punct if they appear as [ or ] (singles). Also, they qualify as punct if they appear as pairs out of order: ]] stuff [[. They would also be punct if you just had [[ and no matching ]]. They would also be punct if they appeared as [[]]-- you can't have a link to absolutely nothing. :)

      So far I'm solving the problem by simply forcing punct to be a single character instead: /[^\w\s]/. That way the punct rule doesn't eat up my string before it can run the token test on relevant pieces, since token is checked before punct. What I was kind of hoping for was a way to use a negative lookahead of some sort... or a way to build an extra rule layer that would somehow account for this.

        This may not be all that robust, but it is better than square one...
        #!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. A harder problem [spaced] link. Yet harder still..]]spaced[[..link. The real problem is [[spaced]] followed by ]] or [[ link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol | word | space | token | punct eol: /\n[ \t]*/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" } | /(?<!\[)\[(?!\[)/ {print "<punct: $item[1]>" } | /(?<!\])\](?!\])/ {print "<punct: $item[1]>" } | /(?<!\[\[)\]\]/ {print "<punct: $item[1]>" } | /(?<!\]\])\[\[/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }
        Notice: new test cases for bracket as 'punct'.

        --hsm

        "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Parse::RecDescent Grammar Fun
by Abigail-II (Bishop) on Jul 25, 2002 at 11:25 UTC
    Your problem is the greedyness of punct. If it's ok that !@#$ will be parsed as four tokens, you should be able to get away with:
    punct: /[^\w\s]/ {print "<punct: $item[1]>" }
    although I have not tested it.

    Otherwise, you may want to try something like (also untested):

    punct: /(?:[^\w\s[]+|\[[^\w\s[]*|\[\[(?!.+?]]))+/ {print "<punct: $item[1]>"}

    Note that your link rule consumes "[[]]]" completely, as a leading [[, a ] as the content, and a trailing ]]. Similary, "[[]] foo [[]]" is consumed by the link rule completely, with "]] foo [[" as the part inside the 'link'.

    Abigail

      Yes, I think letting punct be single character tokens is the way to go here. That regex is probably intuitive to some, but it looks like a maintenance nightmare.

      Thanks for the info on the link rule. I had completely forgotten to stop to think about what all that would match.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://184966]
Approved by svad
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
As of 2015-07-08 06:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (94 votes), past polls