|Welcome to the Monastery|
Creating parser for some syntaxby andal (Friar)
|on Nov 14, 2011 at 12:22 UTC||Need Help??|
There are number of applications out there that use scripting languages like S-Lang, Lua, or even Perl and Python for configuration Effectively, this style of configuring can be considered an RPC, but practically, real RPC very often does not need such variety of features, since it can be dangerous. For example, it is no good to have a tight loop initiated by some RPC request, or some uncaught run-time exception. Even for configuration use of full-scale scripting language might be an overkill.
So, it might be desirable to create parser for somewhat limited scripting language. I believe, the recursive parsing allows creation of modular and simple to extend and to configure parsers. Below is the description of approach I've taken when creating such parser.
First of all, let's look at what we would like to have.
Let's say a program is sequence of expressions. Potentially the expressions can be executed out of sequence, or execution can be stopped. So the parser on the top-level shall have objects that are chained together and provide some method for activating them at run-time. This method shall possibly return indication for desired jump.
Note. If the development was done in C, then one would also have to worry about memory management, but this is easily solved by adding one more method for releasing the object.
In perl the objects corresponding to expressions shall provide only one method, say "execute". This method shall do desired action and maybe provide an address for the expression which shall continue execution. We just need to define what is the "address" returned from "execute". In real life, some expression may contain other chains of expressions, so, when performing a jump, it is important to know not only the address of desired expression, but also the address of chain to which this expression belongs. This way, the actions associated with execution of one chain can be cleanly finished and the execution of another chain activated. In practice, this approach makes impossible jumping from outer block of expressions to inner block, but after all we create parser supposed to be simple, plus how often in RPC one really needs to jump into inner blocks of code?
So, we may require, that the method "execute" of an expression object returns either empty list, or 2 values, first value identifying block where the execution shall continue and the second value identifying expression in that block.
Finally, to execute the code we shall walk through our chain of expressions and call method "execute" of each object. We walk until some object indicates that jump is needed. At this point we compare address of destination block with our own block. If they are the same, then we continue with specified expression. If they are different, then we simply finish walking and pass jump information to the caller.
As you can see, at this level we don't care what is done inside of expression objects. We also don't care, where those objects come from. All we need is support for the described interface.
Now, let's return to the parser itself. We want it to be simple, so we just imagine, that there is chain of objects that may look at the input text and either succeed or fail at producing our expression object. So effectively, our parser should just maintain a chain of objects and during parsing simply activate all of them in turn until one of them produces desired result, after that the procedure repeats. The parsing objects should simply advance the input past the text they have recognized.
So far, so good. But where is the recursive parsing? Of course inside of parsers that produce expression objects. Let's look at an expression that outputs few strings. This expression may look like this
Written in BNF
So, the parser shall try to find word "output" at the current position of the input. If it is not found, then no expression is produced. If the word is found then the parser shall produce the expression object. Internally, expression executor needs more objects which shall produce for it actual strings to be output, those corresponding to non-terminal "string-val". Here we shall recursively repeat everything described above, but this time applied not to "expression" but to "string-val".
At the end of this process we shall have parser which simply maintains few collections of parsing modules that produce objects (opcodes) implementing some non-terminal. So creating some syntax shall be as simple as adding a parser to appropriate collection. Each parser shall be very simple since in normal case it shall only check presence of some entry point in the text and then let parsers for other non-terminals to produce objects that serve as arguments for this parser.
Once the API for each of non-terminals is defined it becomes were easy to reuse parsing modules to implement different syntaxes. If one adds here the requirement for each parsing object to provide method describing the syntax it parses, then it becomes easy to obtain the description of syntax supported by resulting parser.
In real life, there are few other things that have to be taken care of. But all of them can be solved using the above approach. For example support for variables, in the simplest case, can be implemented by parser that maintains names of all variables of this type. This parser checks if the input contains appropriate name and produces opcode that would return value from the slot associated with that name. Name scoping, if desired can be implemented by supporting triggers associated with starting/stopping parsing of expressions block.
Practical example of such parser I've created in just few hours. If anyone would like to look at it I can send it.
To me, this approach to providing RPC appears much more simple and effective than for example SOAP.