|Problems? Is your data what you think it is?|
A few days ago, Limbic~Region was eagerly hunting all over the place for every bit of information he could find on parsing, in particular, on Yacc-style and recursive descent parsers. He was doubting whether there is any use in trying to come up with an implementation idea of your own, and write your own parser, when such decent toolkits are readily available.
I suddenly recalled a parser that I wrote myself, in the days before I had any formal education about compilers. When later I looked into the Dragon Book, I was rather amazed that the system I had come up with, was not described. There is someting that comes somewhat in the neighbourhood, called "Operator Precedence Parsing" (section 4.6). But, at least, I do not have to invent any new weird math symbols in order to explain how it works. :)
Its purpose is to parse and calculate infix expressions, stuff like "2+3*(4+5)". As is the case with recursive descent parsing, it is easy to implement by hand — actually, easier; in fact it's so easy that I had no trouble at all implementing it in pure Z80 assembler. But, in contrast with recursive descent parsing, precedence (calculation order) rules for the various operators are defined by a precedence table: each operator has a numerical precedence value (the higher the value, the tighter it binds) and an associativity type (left or right). As a result, new infix operators can easily be added without changing any code, but by merely adding their entries to the operators table.
The grammar it parses has the basic form:
where value is something like a number or a variable's name, and op is an infix operator.
The code layout is a bit different from the above spec, and in pseudocode, looks like this:
process() uses a stack to hold intermediate values and operators. In addition, there are a few variables holding the last value that got parsed or calculated ($value), and the last operator that was just seen ($op).
Lexing (tokenizing) the data happens in 2 independent lexers: one for the core values, and one for the operators. Which lexer is used depends on whether an operator or a value is expected. That way you can reuse the same symbol for different roles without conflict, such as using "-" for a subtraction operator, and as an unary minus.
If a value is expected, either at the start of an expression, or following an infix operator, and none is recognized, a parsing error is raised.
If an operator is expected and none is seen, it is merely assumed that the end of the expression has been reached, and the parser returns the calculated value, after completing any postponed operations. The input pointer remains where it is, and you can continue parsing using any other means, where it stopped.
Processing happens like this: the precedence of the last recognized operator is compared to the precedence of the operator on top of the stack. If it's higher, the current value and operator are pushed onto the stack and the next item will be parsed first; if it's lower, the previous postponed calculation will be performed first, possibly repeated for any older values and operators on the stack. Finally, the final result, and the operator, are pushed onto the stack.
You can compare this to the LALR (Yacc style) parsers, where people talk about shift (= push) and reduce (= calculate).
If their precedence is the same, the associativity of the operator (on the stack) determines whether a calculation will be performed first (left), or postponed (right). I strongly recommend against ever using different types associativity for the same precedence level, so it doesn't matter which operator's associativity you use.
That's it. That's the whole parser.
How about nested parens? Simple, use the parsing rule
That is, if you recognize an opening paren, recurse into parsing an expression, end by trying to match a closing paren. (As a closing paren is not recognized as an operator, hence the expression parser will just return when it gets there.)
This way, you can nest parens to any depth.
And now: code. This first implementation stays close to the theoretical description. A lot of the following code is not necessary, as it serves to show what is going on.
As an implementation, it's not ideal: it contains code duplication, the precedence/associativity test is quite verbose, and it's not machine code friendly (low on resources, using as few variables and datastructures as possible), due to the local stacks in the expression parser. That all can be remedied using a few simple tricks.
While processing the data on the stack, you do not have to check the stack depth all the time, as you can use a sentinel value on the stack instead. Just make sure the precedence values of all operators are > 0, and give the sentinel a precedence value of 0, lower than any precedence values for the operators, then every calculation will be performed before parsing ever finishes, while the sentinel just sits safely on the stack. This way, we can even use a global stack, such as the return stack when coding in assembler, in a perfectly safe way.
Associativity can be tackled by pretending a left associative operator on the stack has a higher precedence than the same operator that just got parsed, while a right associative operator has a lower or equal (thus, not higher) precedence on stack than by default, and then the calculation will be postponed. In practice, we just have to make sure the precedence values in the table are even, and increment on the precedence value, just before it gets pushed onto the stack, for a left associative operator.
In the follwoing code, only the sub parse_expr has been changed, but I've posted the whole code to ease copy/run on the program.
Extending the parser
Extending the parser to recognize new syntax is easy enough. For example, if you have it recognize a name followed by an opening paren, you can treat this as a prefix function call:
(You will have to recognize the opening paren before you assume a name is just a variable)
One by one, parse (and calculate) the expressions in the arguments list, putting the results onto a stack, and finally, call the associated function.
If you want to recognize and handle assignments to variables in the form of "a=expr", then you'll have to implement a way to have lvalues. I can think up 2 basic routes from the top of my head. One is to use Perl voodoo, and have a variable return an object that stringifies/nummifies to the variable's current value, but which you can also use to modify its value. The other is to always return a record structure for the values.
Extending the idea of lvalues to pre-/post- -increment/-decrement, you'll probably have to forbid "++" and "--" for any other use, so "--2" is no longer equivalent to 2.
It'd be very nice to recognize advanced syntax structures, such as "1 < x < 2", in a more programmer-friendly rather than computer-friendly way, meaning "x > 1 and x < 2". I think it's doable, as we're comparing precedence of the two "<" operators anyway, but it'll have to do some exceptional processing for that particular operator.
Other special cases, like optional operators, so that "2a" could mean "2*a</>", or "foo bar" meaning "<c>foo AND bar" for a search engine, are possible in principle, at the cost of extra complexity.
Instead of just calculating a value on the fly, the parser can easily be modified to return a parse tree, or produce postfix code instead.
Application and embedding
This is the basic parser, which, because it doesn't have to recognize everything (it just stops at the first thing it doesn't recognize, it's perfectly happy if it just recognized a valid expression as a prefix), can easily be embedded in other parsers. For example, you can write a handcoded parser to recognize basic command statements, control structures, etc. where this (sub)parser handles expressions, which most programmers find the hardest things to tackle, anyway.
Like I said, I invented this parsing method myself (some 15 or more years ago), and you're free to use any code, any derived code, or any reimplementations as you see fit. I'd just like some credit if you do so, so please don't pretend you invented it yourself. (Let's say it's under a BSD-style license.) Just, no patenting, or any other such crap.
update: You might want to take a look at my followup where I show a version that returns a parse tree, in the form of Function objects.