Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

The grammar already specifies that tags can be next to one another.

I'm not entirely sure that Parse::RecDescent works the way I describe below, but most parser generators such as YACC and AntLR do.

First of all, the input is split in a stream of tokens. Tokens are specified by regular expressions and are supposed to be separated by some separator (mostly whitespace). Let's assume the input to be parsed looks like:

<tag1>text 1</tag1><tag2>text 2</tag2>
Now we would like to get the following tokens (quoted and separated by commas for legibility):
'<tag1>', 'text 1', '</tag1>', '<tag2>', 'text 2', '</tag2>'
So as you already indicated, there are two types of tokens, tags and text, and they can be defined as you did.
TAG: /<(?:\/?)\w+>/ TEXT: /[^<>]+/
Note the + though. Each of these token definitions will capture what we want them to capture, tags and text respectively. So by applying these definitions, we can split the input in the desired stream of tokens. This is phase one, the lexical analysis. If you're using YACC, you'll do this by using LEX.

Now for phase two: now we specify how the tokens can appear in the input stream so that we consider the input "valid", ie. conform to the grammar. This is quite simple in this case:

INPUT: ( TAG | TEXT )*
Now we're no longer trying to match characters, but rather tokens, so the input to this phase looks like:
TAG TEXT TAG TAG TEXT TAG
This is a very simple grammar, so it isn't obvious here, but the result is in fact a (parse) tree that looks like:
             INPUT
    /   /   |   |   \   \
  TAG TEXT TAG TAG TEXT TAG
(sorry for the rather lousy graphics ;-) If we just want to verify that the input satisfies the grammar, we're done. In general though, we want to do something with the parsed input, so we have to attach actions to the grammar nodes. Something like "add the content of the TEXT to some list" or whatever. This is the semantics of the grammar.

Now for "empty" tokens, it should now seem strange to have an empty token, each token is some meaningful entity in the input.

I hope this clarifies matters a bit, if not, don't hesitate to ask, -gjb-


In reply to Re: Re: Re: Parse::RecDescent by gjb
in thread Parse::RecDescent by ixo111

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (3)
    As of 2014-07-12 11:02 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (239 votes), past polls