http://www.perlmonks.org?node_id=987925


in reply to Re^3: Block-structured language parsing using a Perl module?
in thread Block-structured language parsing using a Perl module?

you have to know enough theory to know both what type of parser you can use on a grammar and if your grammar is even parsable.

Hm. That would be a justification for it, but so far, none of the modules discussion even begins to allow you to answer those types of questions.

I can't speak about the performance of Regexp::Grammars, but if I were doing something like this, I'd start there for ease of use.

Hm. I'm going through the docs for Regexp::Grammars now, and trying to do so with an open mind, but honestly, what I'm reading is making my skin crawl.

The questions I am asking myself at this point are:

I'd use Marpa for speed and completeness.

The trouble with Marpa is that it only does half the job. You have to tokenise the source text yourself, and then feed it to the parser in labeled chunks.

By the time you've written the code to tokenise the input, and then recognise the tokens so you can label them for the "parser", one wonders what parsing there is left for the "parser" to do.

Its like buying a dog and barking yourself.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^5: Block-structured language parsing using a Perl module?
by Anonymous Monk on Aug 17, 2012 at 22:27 UTC

    By the time you've written the code to tokenise the input, and then recognise the tokens so you can label them for the "parser", one wonders what parsing there is left for the "parser" to do.

    Apparently that is "lexing", and parsing is making sure the tokens are in the allowed order

      Apparently that is "lexing", and parsing is making sure the tokens are in the allowed order

      Yes. I am aware of that academical fine distinction. It is all fine and dandy in a nice, theoretical world of white-space delimited, single character tokens, but it doesn't cut it in the real world as far as I'm concerned.

      In many -- arguably, even 'most' -- cases, it in not just a hell of a lot easier to work out where the next token ends if you know what (alternatives) you are expecting, it can be impossible to do so without said information.

      And that means that the hand-written "lexer" you need to write in order to use a Marpa parser, has to effectively replicate the state machine that Marpa constructs.

      At which point, what purpose does the parser serve?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        BrowserUk:

        One advantage of splitting the parser and lexer is that rather than having one humongous state machine that has to cover both grammar and lexing, you can split the task into two smaller machines. As you know, state machine complexity tends to grow exponentially, so two state machines half the size of a combined one can be *much* more tractable.

        Another advantage of splitting is that you can use different techniques in different places. You might use a few simple regexes for your tokenizer and some recursive descent for your parser. If necessary, you can switch techniques in one or the other without rewriting *everything*.

        The last advantage (that I'm writing about) of having the parser part is that you can more easily tie "meaning" to various locations in your grammar. For example, if you're doing some simple lexing, you might discover a symbol name. But what does the symbol *mean* once you lex it? Is it part of a declaration, a function name, a reference on the LHS, a use on the RHS?. Tying the meaning to particular spots in the syntax is a pretty nice feature.

        If you're keeping track of enough information in your lexer to be able to know what's going on and whether the syntax is valid, then I'd argue that you haven't written a lexer, you've written a combination lexer/parser. After all, parsing is simply recognizing correct 'statements' of the grammar, so if you can more easily merge the tokenization with the rule checks, then I wouldn't worry about the 'fancy' methods. While there's nothing wrong with that approach, it might becom burdensome when the language is large/complex enough.

        By the way. I just started playing with Marpa::R2 this afternoon, after reading this thread, so I know little about it so far. But having said that, I really recommend it over Parser::RecDescent. When I tried Parser::RecDescent, it took me forever to start getting results, and it was painful, too. The debugging capabilities really drove me mad. (Reading the trace in Parse::RecDescent is *painful*.)

        But my puttering around with Marpa today was much more enjoyable. I got better results *much* more easily, and after a couple hours, I had a good start on a parser for a toy language. (I'll hopefully get to finish the parser tomorrow.) If I can get it into reasonable shape, I'll try to (remember to) post it.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        I'd really appreciate a little feedback on the above post. Is my logic wrong?