Parse::RecDescent

ixo111 has asked for the wisdom of the Perl Monks concerning the following question:

hi folks,

I'm trying to write a Parse::RecDescent chunk that will split up a line of input in to <tags> and non-tags such that

"<this>this is tagged</this> <bye>bye!</bye>";

becomes a list suchlike -

[
   '<this>',
   'this is tagged',
   '</this>',
   ' ',
   '<bye>',
   'bye!',
   '</bye>'
]
[download]

the below works, but only if TOKEN is specified as 'TAG | LITERAL' .. i'm sure there are a billion things i don't understand about Parse::RecDescent

$Parser = Parse::RecDescent->new(
   q(
      TAG       : <skip:''> /\\<.*?\\>/
      LITERAL   : <skip:''> /[^<]*/
      TOKEN     : TAG | LITERAL

      startrule:
         TOKEN(s) {
            @main::PARSED = @{$item[1]};
         }
   )
);
[download]

can anyone shed any light on why TOKEN would not evaluate in an OR | context properly if the arguments are reversed? (any other comments on what i may be doing improperly are more than welcome)

thanks!

Edit kudra, 2002-10-27 Replaced code tags around entire message with markup

Comment on Parse::RecDescent Select or Download Code

Replies are listed 'Best First'.
Re: Parse::RecDescent by graff (Chancellor) on Oct 27, 2002 at 04:22 UTC
Having looked at the RecDescent man page just now, I'd say that the use of the "or" conjunction causes the first element (before "\|") to be tried first, and the latter to be tried only if the first does not match. In your case, the LITERAL condition matches not only the content between tags, but also everything except the "<" of the tags themselves. When you test for LITERAL first, TAG would never have an oppurtunity to match. Being less familiar with RecDescent, I thought it might be easy to make a suitable regex for use with `split()` to get the same result as what you want, but it seems that this alone would not be enough -- you'd have munge the data a little, before or after the split, to get your array. For example, assuming the whole string with tags and content is now in $_ (and hoping there are no "spurious" angle brackets), here's a method that inserts split-able strings around tags to make split produce intended array: `s/(.)</$1=!=</gs; # use some distinct pattern to mark open brackets s/>(?!=!=)/>=!=/g; # and close brackets that aren't adjacent to "<" @tokens = split( /=!=/ );` [download] If the string begins with initial "<", the first substitution won't match that, so we won't get an empty (fictitious) first element in @tokens. Likewise, the second substitution makes sure that no empty tokens are generated where the input had "...><..." (nothing between adjacent tags) or a string-final ">".	[reply] [d/l] [select]
Re: Re: Parse::RecDescent by ixo111 (Acolyte) on Oct 30, 2002 at 10:36 UTC
heheh, true :) .. i had a working implementation using regexp, but i've long wanted to at least have some experience with Parse::RecDescent, so i determined that I'd use it in this case, easy or no ;) .. thanks for the help!	[reply]
Re: Parse::RecDescent by gjb (Vicar) on Oct 27, 2002 at 00:24 UTC
`LITERAL` can match the empty string due to the `*`, I have a bad feeling about that :-) Hope this helps, -gjb-	[reply] [d/l] [select]
Re: Re: Parse::RecDescent by ixo111 (Acolyte) on Oct 30, 2002 at 10:38 UTC
I believe in this case that is proper, as tags may be next to one another, separated by zero or more of anything that isn't a < .. or am i missing some subtlety there? thanks for the reply!	[reply]
Re: Re: Re: Parse::RecDescent by gjb (Vicar) on Oct 30, 2002 at 19:32 UTC
The grammar already specifies that tags can be next to one another. I'm not entirely sure that Parse::RecDescent works the way I describe below, but most parser generators such as YACC and AntLR do. First of all, the input is split in a stream of tokens. Tokens are specified by regular expressions and are supposed to be separated by some separator (mostly whitespace). Let's assume the input to be parsed looks like: `<tag1>text 1</tag1><tag2>text 2</tag2>` [download] Now we would like to get the following tokens (quoted and separated by commas for legibility): `'<tag1>', 'text 1', '</tag1>', '<tag2>', 'text 2', '</tag2>'` [download] So as you already indicated, there are two types of tokens, tags and text, and they can be defined as you did. `TAG: /<(?:\/?)\w+>/ TEXT: /[^<>]+/` [download] Note the `+` though. Each of these token definitions will capture what we want them to capture, tags and text respectively. So by applying these definitions, we can split the input in the desired stream of tokens. This is phase one, the lexical analysis. If you're using YACC, you'll do this by using LEX. Now for phase two: now we specify how the tokens can appear in the input stream so that we consider the input "valid", ie. conform to the grammar. This is quite simple in this case: `INPUT: ( TAG \| TEXT )*` [download] Now we're no longer trying to match characters, but rather tokens, so the input to this phase looks like: `TAG TEXT TAG TAG TEXT TAG` [download] This is a very simple grammar, so it isn't obvious here, but the result is in fact a (parse) tree that looks like: INPUT / / \| \| \ \ TAG TEXT TAG TAG TEXT TAG (sorry for the rather lousy graphics ;-) If we just want to verify that the input satisfies the grammar, we're done. In general though, we want to do something with the parsed input, so we have to attach actions to the grammar nodes. Something like "add the content of the TEXT to some list" or whatever. This is the semantics of the grammar. Now for "empty" tokens, it should now seem strange to have an empty token, each token is some meaningful entity in the input. I hope this clarifies matters a bit, if not, don't hesitate to ask, -gjb-	[reply] [d/l] [select]
Re: Parse::RecDescent by princepawn (Parson) on Oct 27, 2002 at 13:34 UTC
Try Give these a shot: Text::Balanced or Regexp::Common::balanced	[reply]


laziness, impatience, and hubris
	PerlMonks