Perl Monk, Perl Meditation | |
PerlMonks |
Re: How would you parse this? (rec-descent)by tye (Sage) |
on Oct 26, 2013 at 08:44 UTC ( [id://1059809]=note: print w/replies, xml ) | Need Help?? |
After reading the replies, I think that I may have a feel for what you are hoping to do. My initial response to that desire was pretty much "Use any of the standard techniques that they taught in 'Compiler Writing' class". So the rest of what I'm writing is based on the supposition that you have never taken such a class. If you have, then I'm pretty sure that I've seriously misunderstood what you are asking for and you should probably just ignore this reply. Though, I end up advising against most of those standard techniques below. My guess is that the approach that would most easily suit you is writing a recursive-descent parser. I wouldn't suggest that you use a framework (such as Parse::RecDescent or marpa). There are technical reasons why such frameworks can have disadvantages (that I find many gloss over in trying to discourage the re-inventing of wheels by those just wanting to move their vehicle and not focused on and perhaps not well suited for actual designing of excellent wheels). Some of the reasons have to do with personality traits that I suspect that you share with me. But the biggest reason is that it just really isn't terribly hard to write a decent rec-descent parser from scratch, especially in Perl. I'm pretty sure that your ample general programming skill is well above that required to succeed at this. There is a chance that there will be certain aspects of the task that will seem foreign, but I don't expect many or significant difficulties greater than that for you. I'm not going to try to teach you (that would surely end badly) or even other readers how to write such a parser. I don't have any handy links to particularly good resources or tutorials that cover this topic (and I suspect such might not be a great route forward for this case anyway). I'm not going to try to write such a parser in this node. But I have recently been looking at JSON::Tiny because it actually comes pretty close to how I would have implemented a JSON parser. Now, JSON is likely trivial compared to what you are hoping to parse. But JSON::Tiny is a decent example of how you build a rec-descent parser. So go browse http://cpansearch.perl.org/src/DAVIDO/JSON-Tiny-0.35/lib/JSON/Tiny.pm (the parts that have to do with parsing, of course). You'll see the basic idea of having a function whose job is to consume the text that represents some structure of your grammar. And that function calls other functions that consume sub-structures. JSON::Tiny's _decode_value(), _decode_array(), _decode_object(), _decode_string(), and decode() are just such subs. And they are simple and working code that parses something that is easy to understand and so I hope they make it easy to "get" the concept of rec-descent parsing. So you'll write _parse_program() which will need to call things like _parse_include() (my guess at what the first few lines in your example represent) and _parse_statement(). For most languages, the structure of each such routine ends up closely matching a BNF definition in a grammar describing your language. So you may end up writing something like BNF as you work out the definition of this language and/or the implementation of your parser. The way that writing working code can also serve to create/refine the definition/design of your grammar is part of why I think this approach may work well for you. When I try to reverse-engineer what are the key markers that indicate that those first few lines are supposed to be the equivalent of "#include" or use lines/statements, I think I'm close to guessing wildly. In C, you know such a thing because the lexer gave the following tokens in exactly this order: start-of-line, '#', 'include'. In Perl, you know such because you were expecting the start of a statement when you found the token "use". These are typical of how such items of syntax are defined and parsed in an LALR(1) language or similar language. "LALR(1)" just means that you can tell what type of broad syntaxy construct you have started parsing by strictly moving forward in the text and only looking at the next 2 tokens (Looking Ahead, Left-to-Right, 1 token past the next). You never have to "back up" more than one token. That's why a C parser needs to know which words represent data types vs. other things and why Perl has prefixing keywords like 'sub'. So you may not be dealing with a language that fits in the LALR(1) world. This is another reason why an existing tutorial or existing framework isn't what I'm recommending. I won't discourage you from "straying" beyond LALR(1). In writing the parser, you'll see where the language is "too ambiguous" and adjust it.
One of the challenges in writing a parser (including rec-descent), is how to implement "consume the text". The classic solution is to write a lexer that returns tokens and give yourself a way to look at the next 2 tokens (without consuming them) or a way to "unconsume" at least the last The easiest solution is what JSON::Tiny does: read the whole document into a buffer and run regexes over it always using /\G.../gc in a scalar context (so pos tracks your progress in "consuming"). You may want to, fairly early in the development, encapsulate the applying of /.../gc to the buffer if you ever expect to parse documents that don't easily fit in RAM. I also recommend completely avoiding capturing parens as they kill the performance of this approach, as can be seen by using Parse::RecDescent. Nevermind! (see replies) At long last, I'll give you some custom concrete code to demonstrate one place JSON::Tiny gets this wrong (IMHO) and how I'd do it better. JSON::Tiny includes:
Disadvantages of that code:
A big improvement, IMHO, is more like:
Then go through parsing, validating, replacing, and complaining about escapes after that. If you decide to deal with "won't fit in memory" documents, then parsing string literals might look more like:
Here's a stab at most of the "consume the document stream" functions used above (nothing tested, in case that wasn't as obvious as I think it would be):
I sincerely hope some of that was helpful. - tye
In Section
Seekers of Perl Wisdom
|
|