Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
Parsing with Regexes and Beyondby moritz (Cardinal) |
on Jun 06, 2008 at 07:02 UTC ( [id://690624]=perltutorial: print w/replies, xml ) | Need Help?? |
The topic of parsing comes up regularly. Most texts on parsing are either very theoretical or bound to a particular tool. I'll try to introduce parsing techniques that are well suited to be used with perl, and give just a bit of theoretical background. Intended audience: Perl hackers that can already work with regular expressions, but don't have any formal Computer Science eduction, and want to learn a bit more about the computer scientist's view on parsing, and how to parse simple, recursively defined structures. This tutorial was first posted here, and since then has been slightly improved based on the great feedback in that discussion. Theoretical background: ParsingParsing is the process of matching a text against a formal specification. In computer science the most important question is if the text can be derived from the specification. In practice it is equally important to extract information according to the formal specification, like capturing part of a regex, or building a parse tree from a grammar. Theoretical background: Regular ExpressionsThe name "Regular Expression" comes from computer science (I'll call it CS-Regex for disambiguation), and in CS it has a different meaning than in Perl. CS-Regexes consists of concatenation, alternation, repetition (with *) and grouping. Perl regexes add many features that don't fit into the computer scientist's view on Regexes, like lookaround, capturing groups and backreferences, zero width assertions (like \b) and so on. So CS-Regexes are not as powerful as Perl Regexes, but they have a big advantage: it's easy to build a machine that performs a CS-Regex match. Regular expressions are well suited to match strings with short, fixed patterns, but they can't be used for any type of contstructs that uses nesting, like parentheis in arithmetic expressions, nested XML tags and the like. (You can do this with (??{ .. }) constructs in 5.8 and with recursive regexes in 5.10, but both are hard to write and maintain and generally a pain.) CS-Regexes can be matched with a simple state machine called DFA (deterministic finite automaton), which consists of a bunch of states, and for each character from the input it is decided which state should be the next. So matching a string against a regex takes linear time (compared to the length of the string). Context Free LanguagesIf you want to parse something like arithmetic expressions or a programming language, you need one more feature in your description language: recursion. With recursion you can formulate grammars like this:
The first two definitions are recursive, so they are not "regular" any more. The computer scientist calls this a CFL, a "context free language". Symbols like term are called non-terminals, while '*' represents a terminal symbol, i.e. one that appears verbatimely in the input. Suppose our input string is '2*3+4', the derivation goes like this:
You can see that our short grammar matches the given expression, but not quite the way we'd like to have it. In normal notation '*' has tighter precedence than '+', if we would evaluate our parse tree it would first evaluate 3+4, and then multiply the result by 2. That's not very surprising because our grammar didn't contain any precedence information. We can fix this:
Now we can derive '2*3+4' this way:
(Some lines contain a few derivations in parallel, just to save space. Each production that doesn't result in a terminal symbol produces a pair of parentheses to emphasize the resulting tree structure.) The introduction of separate precedence levels for addition and multiplication leads to a correctly nested parse tree. LexingMost compilers don't parse their input in one pass, but first break it up into smaller pieces called "tokens". The process is called "lexing" or "scanning". There are two important reasons for that:
Lexing is normally done with a DFA, but since we're focusing on perl we can just as well use regular expressions:
The token definition consits of a name for the token, a regular expression and optionally a flag that indicates that the token shouldn't appear in the output. The program loops while the end of the string is reached, trying to match all regexes in turn. As soon as a match is found, the token is pushed onto the @tokens array. (In a real compiler it would also store the current position to allow better error messages on wrong input). The /c modifier on the regex takes care that a failed match doesn't reset pos($input). The \G assertion anchors the regex to the end of the previous match. Implementing the parserFor the parser we need two utility functions:
The production rules can be translated very easily to functions:
You can see that branches (like in factor) and repetition (in expression and term) use a lookahead of one token to decide which branch to take. Fortunately we never have to backtrack, we're always in the right branch. Enough talking, let's test it:
Woot, that looks good! The only annoying thing is that the 4 at the end is embedded into a list of its own, instead of being just a number. But that is easily fixed. Replace return \@res; with return @res > 1 ? \@res : $res[0];, in both expression and term, and the tree is as shallow as we want it to be. If you input a string with parenthesis, like "2*(3+4)" you'll notice that our parse tree doesn't contain any parenthesis, but the structure of the parse tree reflects the stucture they enforced:
One minor gotcha still remains: suppose you parse the string 1 2 - oddly it doesn't throw a syntax error. That's because the parser recognizes the 1 as a valid expression, and then stops. So what it also has to do is recognize the end of input (traditionally called "EOF", "end of file"), and parses until it finds EOF. To get that working, simply add the line push @tokens, ['EOF', '']; after the while-loop of the lexer, and then parse the input string like this:
Now we you run it again with 1 2 as input, it tells you Syntax error: expected EOF, got Number. Since we nearly wrote a calculator by now, we can just as well finish it by writing a short sub that evaluates the parse tree:
A bit more theoryThis parser is called a "recursive descending" parser, because it uses recursion to descent into the rules that finally match terminal symbols. It is a predictive, i.e. it guesses what the next token is, and it's a top down parser. Recursive descending parsers can't parse all context free languages, only a subset called LL(k), where k is the number of lookahead tokens (1 in our case). Instead of predicting the next token, we could read all the tokens and push them onto a stack. When the top items on the stack correspond to the right hand side of a grammar rule we can remove them, and push the rule (annoted with the match tokens, of course) onto the stack. That technique is bottom up parsing (or LR parsing), and the languages that can be parsed are called "deterministic context free languages", short DCFL. It is more flexible, but also more complicated. Parser generators like yacc or bison usually produce such LR parsers. More on Recursive Descending ParsersYou will have noticed that the subs expression and term have the same structure, and only differ in the operator and the next function to be called. One way around that duplicate code is a function factory (you'll notice that I've started to read "Higher Order Perl" ;-). Very good stuff!):
Since you need such a rule for every precedence level, that can save save you a lot of typing. But if you write more parsers, you'll notice that a lot of work will still be duplicated. A good alternative is a module like Parse::RecDescent which does all the work that you normally duplicate (though I admit that I've never used it myself). Perl 6 Rules, the replacement for regular expressions, are another alternative. And don't start screaming "but it's not there yet" - it is. You can use the Parrot Grammar Engine (PGE) to write Perl 6-like rules. A Perl 6 Rule comes in three flavours: "regex", "rule" and "token". A regex is just what you think it is: a regex, but not very regular. A "token" is a regex that doesn't backtrack, and a "rule" is a regex that doesn't backtrack and that treats every whitespace in the rule body as an optional whitespace in the input. So this is what our grammar would look like in Perl 6, with the old one as a comparsion:
Note that [...] are non-capturing groups, while char classes are <[...]> now. <name> is a call to another rule, and it's result is stored in a named capture with the name of the rule (which can be change with <capturename=rulename>). In case of a syntax error this regex will just not match, which is usually not the desired result. You can add a bit of error handling code:
Now when a closing parenthesis is missing you'll get a nice error message. Perl 6 will have an even better mechanism for parsing arithmetic expressions: you can specify what an operator and a what an atom is, and specify the precedence and associativity of each operator. Then Perl uses a bottom-up parser to does what you want:
We write a rule for each infix operator so that one can simply inherit from the grammar (like from a class) and override one rule without having to worry about the rest:
SummaryYou've seen that parsing isn't magic, and while it takes a little while to get use it, it's not all that hard if you can stick to simple grammars. In practice a recursive descending parser will get you very far, and it's not hard to write one. You can either do it manually, or use a module or Perl 6's fabolous rules. There's much more to say about parsing, for example about error handling and recovery, associativity and many other topics, but it's a start ;-) (And as usual I couldn't resist to introduce some Perl 6 stuff. You can see that from Perl 5 to Perl 6 we jump from "regular" to context free languages, which is quite a step in the Chomsky hierarchy.) Update: incorporate ikegami's suggestion on how to handle EOF detection, and to do it at all. Second update: added missing tokens for ( and ), hopefully clarified placement of EOF pushing
Back to
Tutorials
|
|