Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re: Parse::RecDescent Grammar Fun

by ichimunki (Priest)
on Jul 24, 2002 at 21:13 UTC ( #185032=note: print w/ replies, xml ) Need Help??


in reply to Re: Parse::RecDescent Grammar Fun
in thread Parse::RecDescent Grammar Fun

Ah, but they are punct if they appear as [ or ] (singles). Also, they qualify as punct if they appear as pairs out of order: ]] stuff [[. They would also be punct if you just had [[ and no matching ]]. They would also be punct if they appeared as [[]]-- you can't have a link to absolutely nothing. :)

So far I'm solving the problem by simply forcing punct to be a single character instead: /[^\w\s]/. That way the punct rule doesn't eat up my string before it can run the token test on relevant pieces, since token is checked before punct. What I was kind of hoping for was a way to use a negative lookahead of some sort... or a way to build an extra rule layer that would somehow account for this.


Comment on Re: Re: Parse::RecDescent Grammar Fun
Download Code
Re: Re: Re: Parse::RecDescent Grammar Fun
by hsmyers (Canon) on Jul 24, 2002 at 22:51 UTC
    This may not be all that robust, but it is better than square one...
    #!/usr/bin/perl -w use strict; use Parse::RecDescent; my $grammar = join( '', <DATA> ); my $parser = Parse::RecDescent->new( $grammar ) or die "Error: Bad grammar\n"; my $text =<< "SUB_STDIN"; A nicely [[spaced]] link. A poorly[[spaced]]link. Another poorly..[[spaced]]..link. A harder problem [spaced] link. Yet harder still..]]spaced[[..link. The real problem is [[spaced]] followed by ]] or [[ link. SUB_STDIN my $results = $parser->startrule( $text ) or die "Error: Bad text\n"; __DATA__ startrule: <skip:''> bit(s) bit: eol | word | space | token | punct eol: /\n[ \t]*/ {print "<newline>\n" } space: /[ \t]+/ {print "< >" } word: /[\w\']+/ {print "<word: $item[1]>" } punct: /[^\w\s\[\]]+/ {print "<punct: $item[1]>" } | /(?<!\[)\[(?!\[)/ {print "<punct: $item[1]>" } | /(?<!\])\](?!\])/ {print "<punct: $item[1]>" } | /(?<!\[\[)\]\]/ {print "<punct: $item[1]>" } | /(?<!\]\])\[\[/ {print "<punct: $item[1]>" } token: link link: /\[\[(.+?)\]\]/ {print "<link: $item[1]>" }
    Notice: new test cases for bracket as 'punct'.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      Hahaha! We're going to beat this grammar into submission yet. :)

      Unfortunately we can't brute force it, think of the labor and testing involved to add new tags. I think the best I can do here is to collect punct as single character chunks (storing them in a temp var), then, when I get to a token, insert that temp var back into the tree. I'd post code, but rather than printing discrete sensible morphemes, I really just need the morphemes (productions in P::RD-speak) concatenated in a string. For that purpose whether it emits punct one character at a time or in chunks won't matter.

      Either way, this Parse::RecDescent module is the best thing since HTML::TokeParser[::Simple], imho.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://185032]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-07-23 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (139 votes), past polls