Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Parse::RecDescent Grammar Fun

by ichimunki (Priest)
on Jul 24, 2002 at 17:57 UTC ( [id://184966]=perlquestion: print w/replies, xml ) Need Help??

ichimunki has asked for the wisdom of the Perl Monks concerning the following question:

Below is my code. I am happy with my grammar except that if a token is surrounded by punctuation then the token gets slurped into it. I want the third line of $text to result in ... <punct: ..><link: spaced><punct: ..> ..., not in ... <punct: ..[[><word: spaced><punct: ]]> ... (as happens currently). Anybody know how I can fix it? Also, if it looks like my grammar is screwy, or there are other things I can do to improve this, let me know-- I am just getting my arms around the basics of this very cool module, so pointers will be appreciated. Thanks!
#!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; #while(<STDIN>){ $text .= $_; } my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol | word | space | token | punct eol: /\n[ \t]*/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s]+/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }

Replies are listed 'Best First'.
Re: Parse::RecDescent Grammar Fun
by hsmyers (Canon) on Jul 24, 2002 at 20:53 UTC
    Mostly your immediate problem is that you told the grammar that [ and ] are 'punct' so it happily eats the delimiter for 'link'! Try:
    punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" }
    Note the inclusion of [ and ] (escaped) in the definition of what 'punct' is 'not'.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      Ah, but they are punct if they appear as [ or ] (singles). Also, they qualify as punct if they appear as pairs out of order: ]] stuff [[. They would also be punct if you just had [[ and no matching ]]. They would also be punct if they appeared as [[]]-- you can't have a link to absolutely nothing. :)

      So far I'm solving the problem by simply forcing punct to be a single character instead: /[^\w\s]/. That way the punct rule doesn't eat up my string before it can run the token test on relevant pieces, since token is checked before punct. What I was kind of hoping for was a way to use a negative lookahead of some sort... or a way to build an extra rule layer that would somehow account for this.

        This may not be all that robust, but it is better than square one...
        #!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. A harder problem [spaced] link. Yet harder still..]]spaced[[..link. The real problem is [[spaced]] followed by ]] or [[ link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol | word | space | token | punct eol: /\n[ \t]*/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" } | /(?<!\[)\[(?!\[)/ {print "<punct: $item[1]>" } | /(?<!\])\](?!\])/ {print "<punct: $item[1]>" } | /(?<!\[\[)\]\]/ {print "<punct: $item[1]>" } | /(?<!\]\])\[\[/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }
        Notice: new test cases for bracket as 'punct'.

        --hsm

        "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Parse::RecDescent Grammar Fun
by Abigail-II (Bishop) on Jul 25, 2002 at 11:25 UTC
    Your problem is the greedyness of punct. If it's ok that !@#$ will be parsed as four tokens, you should be able to get away with:
    punct: /[^\w\s]/ {print "<punct: $item[1]>" }
    although I have not tested it.

    Otherwise, you may want to try something like (also untested):

    punct: /(?:[^\w\s[]+|\[[^\w\s[]*|\[\[(?!.+?]]))+/ {print "<punct: $item[1]>"}

    Note that your link rule consumes "[[]]]" completely, as a leading [[, a ] as the content, and a trailing ]]. Similary, "[[]] foo [[]]" is consumed by the link rule completely, with "]] foo [[" as the part inside the 'link'.

    Abigail

      Yes, I think letting punct be single character tokens is the way to go here. That regex is probably intuitive to some, but it looks like a maintenance nightmare.

      Thanks for the info on the link rule. I had completely forgotten to stop to think about what all that would match.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://184966]
Approved by svad
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 13:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found