Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Perl Monk, Perl Meditation
 
PerlMonks  

Creating parser for some syntax

by andal (Friar)
on Nov 14, 2011 at 12:22 UTC ( #937933=perlmeditation: print w/ replies, xml ) Need Help??

There are number of applications out there that use scripting languages like S-Lang, Lua, or even Perl and Python for configuration Effectively, this style of configuring can be considered an RPC, but practically, real RPC very often does not need such variety of features, since it can be dangerous. For example, it is no good to have a tight loop initiated by some RPC request, or some uncaught run-time exception. Even for configuration use of full-scale scripting language might be an overkill.

So, it might be desirable to create parser for somewhat limited scripting language. I believe, the recursive parsing allows creation of modular and simple to extend and to configure parsers. Below is the description of approach I've taken when creating such parser.

First of all, let's look at what we would like to have.

Let's say a program is sequence of expressions. Potentially the expressions can be executed out of sequence, or execution can be stopped. So the parser on the top-level shall have objects that are chained together and provide some method for activating them at run-time. This method shall possibly return indication for desired jump.

Note. If the development was done in C, then one would also have to worry about memory management, but this is easily solved by adding one more method for releasing the object.

In perl the objects corresponding to expressions shall provide only one method, say "execute". This method shall do desired action and maybe provide an address for the expression which shall continue execution. We just need to define what is the "address" returned from "execute". In real life, some expression may contain other chains of expressions, so, when performing a jump, it is important to know not only the address of desired expression, but also the address of chain to which this expression belongs. This way, the actions associated with execution of one chain can be cleanly finished and the execution of another chain activated. In practice, this approach makes impossible jumping from outer block of expressions to inner block, but after all we create parser supposed to be simple, plus how often in RPC one really needs to jump into inner blocks of code?

So, we may require, that the method "execute" of an expression object returns either empty list, or 2 values, first value identifying block where the execution shall continue and the second value identifying expression in that block.

Finally, to execute the code we shall walk through our chain of expressions and call method "execute" of each object. We walk until some object indicates that jump is needed. At this point we compare address of destination block with our own block. If they are the same, then we continue with specified expression. If they are different, then we simply finish walking and pass jump information to the caller.

As you can see, at this level we don't care what is done inside of expression objects. We also don't care, where those objects come from. All we need is support for the described interface.

Now, let's return to the parser itself. We want it to be simple, so we just imagine, that there is chain of objects that may look at the input text and either succeed or fail at producing our expression object. So effectively, our parser should just maintain a chain of objects and during parsing simply activate all of them in turn until one of them produces desired result, after that the procedure repeats. The parsing objects should simply advance the input past the text they have recognized.

So far, so good. But where is the recursive parsing? Of course inside of parsers that produce expression objects. Let's look at an expression that outputs few strings. This expression may look like this

output "Hello there", friend_name

Written in BNF

<expression> ::= output <string-val> (',' <string-val>)*

So, the parser shall try to find word "output" at the current position of the input. If it is not found, then no expression is produced. If the word is found then the parser shall produce the expression object. Internally, expression executor needs more objects which shall produce for it actual strings to be output, those corresponding to non-terminal "string-val". Here we shall recursively repeat everything described above, but this time applied not to "expression" but to "string-val".

At the end of this process we shall have parser which simply maintains few collections of parsing modules that produce objects (opcodes) implementing some non-terminal. So creating some syntax shall be as simple as adding a parser to appropriate collection. Each parser shall be very simple since in normal case it shall only check presence of some entry point in the text and then let parsers for other non-terminals to produce objects that serve as arguments for this parser.

Once the API for each of non-terminals is defined it becomes were easy to reuse parsing modules to implement different syntaxes. If one adds here the requirement for each parsing object to provide method describing the syntax it parses, then it becomes easy to obtain the description of syntax supported by resulting parser.

In real life, there are few other things that have to be taken care of. But all of them can be solved using the above approach. For example support for variables, in the simplest case, can be implemented by parser that maintains names of all variables of this type. This parser checks if the input contains appropriate name and produces opcode that would return value from the slot associated with that name. Name scoping, if desired can be implemented by supporting triggers associated with starting/stopping parsing of expressions block.

Practical example of such parser I've created in just few hours. If anyone would like to look at it I can send it.

To me, this approach to providing RPC appears much more simple and effective than for example SOAP.

Comment on Creating parser for some syntax
Select or Download Code
Reaped: Re: Creating parser for some syntax
by NodeReaper (Curate) on Nov 14, 2011 at 12:54 UTC
Re: Creating parser for some syntax
by moritz (Cardinal) on Nov 14, 2011 at 13:32 UTC
    Effectively, this style of configuring can be considered an RPC

    You've lost me there already. Could you please explain what commonalities you see between RPC (which I assume means "Remote procedure call") and configuration by embedding a scripting language?

    And since you mention danger later on, I see a huge difference here. In the RPC situation, both the caller and the callee have to guard against malicious incoming data. In the case of configuration via scripting languages, the attitude is often "if you screw up your program in the configuration, that's your problem", since the one who writes the configuration is usually also the user of the system, or a trusted party (like an administrator).

    Now, let's return to the parser itself. We want it to be simple, so we just imagine, that there is chain of objects that may look at the input text and either succeed or fail at producing our expression object. So effectively, our parser should just maintain a chain of objects and during parsing simply activate all of them in turn until one of them produces desired result, after that the procedure repeats. The parsing objects should simply advance the input past the text they have recognized.

    That sounds rather similar to how Perl 6 regexes are implemented. There each call to a subregex returns a "Cursor" object, which holds a reference to the original string, knows where the last regex call left off, and a few other pieces of data.

    If a regex can match in several ways (ie if the engine can backtrack over that regex), a lazy list of cursors is returned instead of just a cursor. For parsing simple languages you can often get very far without ever using backtracking.

    One thing that your scheme neglects is dealing with whitespace and comments. If you look at your examples again:

    output "Hello there", friend_name
    Written in BNF
    <expression> ::= output <string-val> (',' <string-val>)*

    You'll notice that there's no rule for parsing the whitespace in the input.

    There are three possible solution: Writing a lexer that deal with the whitespace, taking care of whitespace in each rule (clutters the grammar), or having data-driven parsing rules where another piece of code adds code for handling whitespace.

    To me, this approach to providing RPC appears much more simple and effective than for example SOAP.

    You've lost me again. You've described a parser (in the context of RPC), and now you say it's simpler than SOAP. But a parser itself can never replace a complete RPC protocol (unless your previous use of RPC was complete overkill).

    You need a network layer, serialization and deserialization (of which parsing is only a part), error handling, security layers (authentification and authorization, signing request) and so on.

    So, please describe your use case a bit more, because in the general case I have no idea how parsing can replace SOAP (or any RPC really), and I don't see the connection between configuration and RPC either. But I do think you're onto something interesting.

      For parsing simple languages you can often get very far without ever using backtracking.
      Actually, that's also the case for non simple languages. Perl usually isn't qualified as a simple language, yet parsing it doesn't require backtracking. Many languages, including Perl, use a grammar that requires one token to look ahead -- although sometimes perl cheats and scans (but not tokenizes) the stream ahead.

      Limited lookahead parsing means that your grammar is written in such away that the compiler only needs to look at a fixed number of tokens before it can decide which grammar rule it has to apply to continue parsing.

      If I were to make a language, be it an embedded one or a standalone, I would want the compiling phase reasonable fast. I certainly don't want any backtracking, so a full blown regexp would be out. I may use a bunch of simple regexes in the tokenizer, but the main parsing loop should be table driven or be expressable in a state machine. But no backtracking.

      I found it very difficult to understand the OP, but I do believe, the OP is in the process of inventing "Higher-Order Perl"

      Ch2 especially explains a dispatch-table based config file parser

      You've lost me again. You've described a parser (in the context of RPC), and now you say it's simpler than SOAP. But a parser itself can never replace a complete RPC protocol (unless your previous use of RPC was complete overkill).

      Well, you are right, parser does not replace everything involved into RPC. Still, for the network layer we need only serialization/deserialization. The rest of stuff (authentication, encryption and so on) can be part of the RPC code itself. Nothing prevents you from allowing only authentication procedure for the first requests and then extending/replacing parser for the subsequent requests.

      Serialization can be very simple. For example something similar to Chunked Transfer of HTTP. First comes line with length, then specified number of bytes, finally comes the with zero to indicate the end of request/response.

      Please don't consider my post as an attempt to teach how to do GOOD compiler. It was an attempt to describe some way to create SIMPLE parser. This parser will be inefficient, since efficiency usually comes at the cost of complexity. But for one thing, working with SOAP is neither efficient, nor simple.

      SOAP is an attempt to make one thing suitable for all possible usages. I believe that it is more appropriate to create tools fitting specific needs and I'm searching for ways to make this approach simple.

Re: Creating parser for some syntax
by sundialsvc4 (Monsignor) on Nov 15, 2011 at 12:27 UTC

    I admit that I only made it through about three paragraphs before I stopped in confusion and said ... “Parse::RecDescent.”

    Any parser gives you the capability of processing an arbitrarily complex (if well-defined) input stream, using what is not an excessively-obfuscatory program.   (The voodoo magic is in the parser itself, but you don’t have to look at that.)

    So I guess I’m saying (and, mind you, I am saying it very politely), is:   “I don’t get it.   What is your point?   None of this is ‘the undiscovered country.’   There are plenty of parsers out there.   Plenty of good ones.   Nothing to invent here ... move along ... move along ...”

      I don't doubt that there are existing libraries and modules that allow one to create some parser. The question is always "how suitable are those libraries for my needs?". After struggling through documentation for few libraries just to find out that they don't completely fit, I came to conclusion that it is simpler to create my own parser. So if you like it, in my post I was trying to say that creating a parser is not very hard and in certain cases it makes no sense to waste your time on understanding how some existing parsers work, it is better to use it for creating the parser that perfectly suits your needs.

      I hope this clarifies your confusion :)

Reaped: Re: Creating parser for some syntax
by NodeReaper (Curate) on Nov 15, 2011 at 13:33 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://937933]
Approved by ww
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2014-04-17 05:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (439 votes), past polls