Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^6: Block-structured language parsing using a Perl module?

by BrowserUk (Pope)
on Aug 17, 2012 at 22:42 UTC ( #988115=note: print w/ replies, xml ) Need Help??


in reply to Re^5: Block-structured language parsing using a Perl module?
in thread Block-structured language parsing using a Perl module?

Apparently that is "lexing", and parsing is making sure the tokens are in the allowed order

Yes. I am aware of that academical fine distinction. It is all fine and dandy in a nice, theoretical world of white-space delimited, single character tokens, but it doesn't cut it in the real world as far as I'm concerned.

In many -- arguably, even 'most' -- cases, it in not just a hell of a lot easier to work out where the next token ends if you know what (alternatives) you are expecting, it can be impossible to do so without said information.

And that means that the hand-written "lexer" you need to write in order to use a Marpa parser, has to effectively replicate the state machine that Marpa constructs.

At which point, what purpose does the parser serve?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


Comment on Re^6: Block-structured language parsing using a Perl module?
Downvoters: comments please! (Anonymous is fine.)
by BrowserUk (Pope) on Aug 18, 2012 at 04:18 UTC

    I'd really appreciate a little feedback on the above post. Is my logic wrong?

      Hi I am the Everything::SockPuppet module. I've been downvoting your post because you were the lucky winner of my PRNG today!
Re^7: Block-structured language parsing using a Perl module?
by roboticus (Canon) on Aug 18, 2012 at 06:43 UTC

    BrowserUk:

    One advantage of splitting the parser and lexer is that rather than having one humongous state machine that has to cover both grammar and lexing, you can split the task into two smaller machines. As you know, state machine complexity tends to grow exponentially, so two state machines half the size of a combined one can be *much* more tractable.

    Another advantage of splitting is that you can use different techniques in different places. You might use a few simple regexes for your tokenizer and some recursive descent for your parser. If necessary, you can switch techniques in one or the other without rewriting *everything*.

    The last advantage (that I'm writing about) of having the parser part is that you can more easily tie "meaning" to various locations in your grammar. For example, if you're doing some simple lexing, you might discover a symbol name. But what does the symbol *mean* once you lex it? Is it part of a declaration, a function name, a reference on the LHS, a use on the RHS?. Tying the meaning to particular spots in the syntax is a pretty nice feature.

    If you're keeping track of enough information in your lexer to be able to know what's going on and whether the syntax is valid, then I'd argue that you haven't written a lexer, you've written a combination lexer/parser. After all, parsing is simply recognizing correct 'statements' of the grammar, so if you can more easily merge the tokenization with the rule checks, then I wouldn't worry about the 'fancy' methods. While there's nothing wrong with that approach, it might becom burdensome when the language is large/complex enough.

    By the way. I just started playing with Marpa::R2 this afternoon, after reading this thread, so I know little about it so far. But having said that, I really recommend it over Parser::RecDescent. When I tried Parser::RecDescent, it took me forever to start getting results, and it was painful, too. The debugging capabilities really drove me mad. (Reading the trace in Parse::RecDescent is *painful*.)

    But my puttering around with Marpa today was much more enjoyable. I got better results *much* more easily, and after a couple hours, I had a good start on a parser for a toy language. (I'll hopefully get to finish the parser tomorrow.) If I can get it into reasonable shape, I'll try to (remember to) post it.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      One advantage of splitting the parser and lexer is that rather than having one humongous state machine that has to cover both grammar and lexing, you can split the task into two smaller machines. As you know, state machine complexity tends to grow exponentially, so two state machines half the size of a combined one can be *much* more tractable.

      This is where I have the problem: I don't see the need for two state machines. Nor even a bigger state machine.

      The state machine built by the grammar compiler, knows (Should know!) everything needed and have all the state required to perform the tokenising. Nothing extra is required. Unless you force the unnecessary split between the parsing and the lexing, at which point you have to duplicate everything.

      An example. In perl, there is an unfortunate ambiguity in the the parsing of print statements:

      print ( 2 + 3 ) * 4;; print (...) interpreted as function at (eval 9) line 1, <STDIN> line 1 +. 5

      Note: There is no error relating to the * 4 bit. Why? Because print returns a number which can be multiplied:

      print print ( 2 + 3 ) * 4;; print (...) interpreted as function at (eval 10) line 1, <STDIN> line +2. 5 4

      With sufficient knowledge, this ambiguity could be resolved; and the parser should have enough knowledge (state) to know whether the print statement is in a void context as in the first case above, in which case it should treat the parens as grouping rather than function args; or in a list context as in the second example when the latter is perhaps better.

      Marpa requires that you label the tokens when the lexer passes them to the parser (using the misnamed ->read( 'label', value ) method ). That means the lexer has to have already recognised what the token is; which means that it needs to maintain the same state as the parser in order to make that determination.

      At that point, what's the purpose in passing the token to the parser? Do we need a second opinion?

      It is easy to see how Marpa can make the claim that "Marpa does not use lookahead and never backtracks.": it's because it makes the lexer do all the lookahead and backtracking! (And therefore duplicate all the state required to do so.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        BrowserUk:

        I'll have to disagree with you there. I really don't see any need for the lexer to do all the machinations you're worrying with. Traditionally, all that stuff would be in the parser. For example, the lexer for my current Marpa experiments is this:

        #--------------------------------------------------------------------- +------- # LEXER section. Grab tokens and feed to the recognizer # my %keywords = map { $_=>undef } # Keywords for a toy language qw( do else end goto if last loop next print program then while un +til ), # Artificial tokens qw( number string ) ; # Operators $keywords{$$_[1]}='OP_' . uc $$_[0] for ( [lparen=>'('], [rparen=>')'], [mult=>'*'], [div=>'/'], [add=>'+'], [subtr=>'-'], [EOS=>';'], [comment=>'#'], [EQ=>'='], [NEQ=>'<>'], [LT=>'<'], [LTE=>'<='], [GT=>'>'], [GTE=>' +>='], [EQADD=>'+='], [EQSUB=>'-='], [EQMUL=>'*='], [EQDIV=>'/='], [COMMA +=>','], ); my $FName = shift or die "Missing filename"; open my $FH, '<', $FName; my $token; my $cnt=0; my $curpos=0; my $fl_die=0; OUTER: while (<$FH>) { s/\s+$//; printf "\n% 3u: %s\n", $., $_; pos($_)=0; while (!$fl_die) { /\G\s*/gc; $curpos = pos($_); last if $curpos>=length($_); ++$cnt; # last OUTER if $cnt>40; if (/\G([-+\/*]=?|=)/gc) { $token=tk_xform('OP', $1) } elsif (/\G([;:,])/gc) { $token=tk_xform('OP', $1) } elsif (/\G(<[=>]?|>=?)/gc) { $token=tk_xform('OP', $1) } elsif (/\G(#.*)/gc) { $token=['COMMENT', $1] } elsif (/\G(".*?")/gc) { $token=['string',$1] } elsif (/\G(\d+)/gc) { $token=['number', $1] } elsif (/\G(\w[_\w]*)/gc) { $token=tk_xform('name', $1) } else { $token=['ERROR','UNEXPECTED INPUT', substr($_,pos($_))] +; ++$fl_die } print("ABEND (token #:$cnt\n") && last OUTER if $fl_die; next unless defined $token; if ($fl_trace) { print " " . (" " x $curpos) . "^"; no warnings; if ($$token[0] eq 'COMMENT') { print "comment (ignored) +" } elsif (!defined $$token[1]) { print $$token[0] } else { print "$$token[0]=$$toke +n[1]" } print "\n"; } next if $$token[0] eq 'COMMENT'; # Feed the token into the parser if (@$token < 2) { push @$token, $$token[0] +; } $P->read(@$token); #print " progress: ", join(", ", map { "(".join(",",@$_).")" + } @{$P->progress}), "\n"; #print " expected: ", join(", ", @{$P->terminals_expected}), + "\n"; $token=undef; } }

        As you can see, it's pretty straightforward, and it doesn't worry about tokens other than the current one. It's all of 70 lines, much of which is pretty printing and diagnostics stuff. It doesn't care a whit about parenthesis rules and such, it just chops the text into convenient tokens and passes them along to the parser.

        All the ambiguities you're wrestling with are bits of the parser. The grammar is where we throw things in to enforce token ordering, assign meaning to statements and such. So far my grammar looks like this (I'm still whacking away at my parser, so it's still in progress):

        my $TG = Marpa::R2::Grammar->new({ start=>'FILE', actions=>'ToyLang', default_action=>'swallow', unproductive_ok=>[qw( FILE )], rules=>[ # A file contains a PROGRAM and zero or more SUBROUTINES. + The # subroutine definitions may precede and/or follow the pro +gram. [ FILE=>[qw( PROGRAM name stmt_list )], 'swallow' ], #PROG +RAM FILE2)], ], [ FILE=>[qw( COMMENT FILE )], ], #stmt_list PROGRAM FILE2) +], ], # [ FILE=>[qw( PROGRAM name stmt_list PROGRAM FILE2)], ], # [ FILE=>[qw( SUB name stmt_list sub FILE)], ], # [ FILE2=>[qw( SUB name stmt_list sub FILE2)], ], # A statement list consists of zero or more statements fol +lowed # by END. We don't care whether or not there's an end of # statement marker before END. [ stmt_list=>[qw( END )], 'discard' ], [ stmt_list=>[qw( stmt stmt_list_2 )], 'first_on' ], [ stmt_list_2=>[qw( END )], 'discard' ], [ stmt_list_2=>[qw( OP_EOS END )] ], [ stmt_list_2=>[qw( OP_EOS stmt stmt_list_2 )], 'second_on +' ], # [ stmt=>[qw( IF expr if_body )], ], [ stmt=>[qw( PRINT expr print_body )], ], [ stmt=>[qw( WHILE expr DO do_body )], ], # [ stmt=>[qw( DO do_body WHILE expr )], ], [ stmt=>[qw( name assop expr )], 'binary_op' ], [ do_body=>[qw( LOOP )], ], [ do_body=>[qw( stmt do_body_2 )], 'first_on' ], [ do_body_2=>[qw( LOOP )], ], [ do_body_2=>[qw( OP_EOS LOOP )], 'second_arg' ], [ do_body_2=>[qw( OP_EOS stmt do_body_2 )], 'second_arg' ] +, [ print_body=>[qw( OP_EOS )], ], [ print_body=>[qw( OP_COMMA expr print_body )], 'second_on +' ], [ expr=>[qw( term )], 'first_arg' ], [ expr=>[qw( expr logop expr )], 'binary_op' ], [ term=>[qw( term addop term )], 'binary_op' ], [ term=>[qw( factor )], 'first_arg'], [ factor=>[qw( factor mulop factor )], 'binary_op'], [ factor=>[qw( name )], 'first_arg'], [ factor=>[qw( number )], 'first_arg'], [ factor=>[qw( string )], 'first_arg'], [ addop=>[qw( OP_ADD )], 'first_arg'], [ addop=>[qw( OP_SUB )], 'first_arg'], [ assop=>[qw( OP_EQ )], 'first_arg'], [ assop=>[qw( OP_EQADD )], 'first_arg'], [ logop=>[qw( OP_NEQ )], 'first_arg'], [ mulop=>[qw( OP_MUL )], ], [ mulop=>[qw( OP_DIV )], ], ], });

        It's a pretty simple grammar, but it's coming along nicely. I'm guessing that when the grammar is done, it'll be about twice this size.

        Writing character-by-character grammar rules to recognize numbers, keywords, strings, comments, etc. would be a pain in BNF. I don't really look forward to writing a zillion BNF productions to specify what tokens look like character by character. But that sort of thing is trivial for regexes, so I split the lexer out into a simple bit of regex code to create the tokens, and the grammar is relatively straightforward too.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://988115]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2014-12-23 00:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (133 votes), past polls