Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

RFC: Parsing with perl - Regexes and beyond

by moritz (Cardinal)
on Apr 03, 2008 at 08:31 UTC ( #678119=perlmeditation: print w/ replies, xml ) Need Help??

The topic of parsing comes up regularly. Most texts on parsing are either very theoretical or bound to a particular tool. I'll try to introduce parsing techniques that are well suited to be used with perl, and give just a bit of theoretical background.

Intended audience: Perl hackers that can already work with regular expressions, but don't have any formal Computer Science eduction, and don't really know how to parse things that are too complex for regular expressions.

Theoretical background: Parsing

Parsing is the process of matching a text against a formal specification.

In computer science the most important question is if the text can be derived from the specification.

In practice it is equally important to extract information according to the formal specification, like capturing part of a regex, or building a parse tree from a grammar.

Theoretical background: Regular Expressions

The name "Regular Expression" comes from computer science (I'll call it CS-Regex for disambiguation), and in CS it has a different meaning than in Perl.

CS-Regexes consists of concatenation, alternation, repetition (with *) and grouping. Perl regexes add many features that don't fit into the computer scientist's view on Regexes, like lookaround, capturing groups and backreferences, zero width assertions (like \b) and so on.

So CS-Regexes are not as powerful as Perl Regexes, but they have a big advantage: it's easy to build a machine that performs a CS-Regex match.

Regular expressions are well suited to match strings with short, fixed patterns, but they can't be used for any type of contstructs that uses nesting, like parentheis in arithmetic expressions, nested XML tags and the like. (You can do this with (??{ .. }) constructs in 5.8 and with recursive regexes in 5.10, but both are hard to write and maintain and generally a pain.)

CS-Regexes can be matched with a simple state machine called DFA (deterministic finite automaton), which consists of a bunch of states, and for each character from the input it is decided which state should be the next. So matching a string against a regex takes linear time (compared to the length of the string).

Context Free Languages

If you want to parse something like arithmetic expressions or a programming language, you need one more feature in your description language: recursion.

With recursion you can formulate grammars like this:

term -> atom operator term term -> '(' term ')' term -> atom operator -> '*' | '/'| '+' | '-' atom -> \d+

The first two definitions are recursive, so they are not "regular" any more. The computer scientist calls this a CFL, a "context free language".

Symbols like term are called non-terminals, while '*' represents a terminal symbol, i.e. one that appears verbatimely in the input.

Suppose our input string is '2*3+4', the derivation goes like this:

term -> atom operator term '2' operator term '2' '*' term '2' '*' (atom operator term) '2' '*' ('3' operator term) '2' '*' ('3' '+' term) '2' '*' ('3' '+' (atom)) '2' '*' ('3' '+' ('4'))

You can see that our short grammar matches the given expression, but not quite the way we'd like to have it. In normal notation '*' has tighter precedence than '+', if we would evaluate our parse tree it would first evaluate 3+4, and then multiply the result by 2.

That's not very surprising because our grammar didn't contain any precedence information. We can fix this:

expression -> (term add_op)* term term -> (factor mul_op)* factor factor -> atom | '(' expression ')' add_op -> '+' | '-' mul_op -> '*' | '/' atom -> \d+

Now we can derive '2*3+4' this way:

expression term add_op term term '+' (factor) term '+' ('4') (factor mulop factor) '+' ('4') ((atom) mulop (atom)) '+' ('4') (('2') '*' ('3') ) '+' ('4')

(Some lines contain a few derivations in parallel, just to save space. Each production that doesn't result in a terminal symbol produces a pair of parentheses to emphasize the resulting tree structure.)

The introduction of separate precedence levels for addition and multiplication leads to a correctly nested parse tree.

Lexing

Most compilers don't parse their input in one pass, but first break it up into smaller pieces called "tokens". The process is called "lexing" or "scanning".

There are two important reasons for that:

  1. Efficiency. Parsing CFLs takes O(n) time, scanning is done in linear time. By reducing parsing to tokens instead of characters the n decreases.
  2. Handling whitespaces and comments: The lexer can just throw away whitespceas and comments, that way we don't have to clutter up the grammar with rules how to handle whitespaces.

Lexing is normally done with a DFA, but since we're focusing on perl we can just as well use regular expressions:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @token_def = ( [Whitespace => qr{\s+}, 1], [Comment => qr{#.*\n?$}m, 1], [AddOp => qr{[+-]} ], [MulOp => qr{[*/]} ], [Number => qr{\d+} ], [OpenParen => qr{\(} ], [CloseParen => qr{\)} ], ); my $input = $ARGV[0] || "2 * 3 + 4 # and a comment\n"; my @tokens; pos($input) = 0; while(pos($input) < length $input){ my $matched = 0; for my $t (@token_def){ my ($name, $re, $ignore_flag) = @$t; if ($input =~ m/\G($re)/gc){ $matched = 1; next if $ignore_flag; push @tokens, [$name, $1]; next; } } die "Syntax error at postion " . pos($input) unless $matched } print Dumper \@tokens; __END__ $VAR1 = [ [ 'Number', '2' ], [ 'MulOp', '*' ], [ 'Number', '3' ], [ 'AddOp', '+' ], [ 'Number', '4' ] ];

The token definition consits of a name for the token, a regular expression and optionally a flag that indicates that the token shouldn't appear in the output.

The program loops while the end of the string is reached, trying to match all regexes in turn. As soon as a match is found, the token is pushed onto the @tokens array. (In a real compiler it would also store the current position to allow better error messages on wrong input).

The /c modifier on the regex takes care that a failed match doesn't reset pos($input). The \G assertion anchors the regex to the end of the previous match.

Implementing the parser

For the parser we need two utility functions:

sub match { my $expected_token = shift; if ($tokens[0][0] eq $expected_token){ my $current = shift @tokens; return $current->[1]; } else { die "Syntax error: expected $expected_token, got $tokens[0][0] +\n"; } } sub lookahead { my @expected = @_; no warnings 'uninitialized'; for (0 .. $#expected){ return 0 if $tokens[$_][0] ne $expected[$_]; } return 1; }

The production rules can be translated very easily to functions:

sub expression { my @res = (term()); while (lookahead('AddOp')){ push @res, match('AddOp'); push @res, term(); } return \@res; } sub term { my @res = (factor()); while (lookahead('MulOp')){ push @res, match('MulOp'); push @res, factor(); } return \@res; } sub factor { if (lookahead('OpenParen')){ match('OpenParen'); my $res = expression(); match('CloseParen'); return $res; } else { atom(); } } sub atom { match('Number'); }

You can see that branches (like in factor) and repetition (in expression and term) use a lookahead of one token to decide which branch to take. Fortunately we never have to backtrack, we're always in the right branch.

Enough talking, let's test it:

my $parse_tree = expression(); print Dumper $parse_tree; __END__ $VAR1 = [ [ '2', '*', '3' ], '+', [ '4' ] ];

Woot, that looks good!

The only annoying thing is that the 4 at the end is embedded into a list of its own, instead of being just a number. But that is easily fixed. Replace return \@res; with

return @res > 1 ? \@res : $res[0];
, in both expression and term, and the tree is as shallow as we want it to be.

If you input a string with parenthesis, like "2*(3+4)" you'll notice that our parse tree doesn't contain any parenthesis, but the structure of the parse tree reflects the stucture they enforced:

# parse tree for '2 * (3+4)': [ '2', '*', [ '3', '+', '4' ], ];

Since we nearly wrote a calculator by now, we can just as well finish it by writing a short sub that evaluates the parse tree:

sub execute { my $tree = shift; return $tree unless ref $tree; my %ops = ( '*' => sub { $_[0] * $_[1] }, '/' => sub { $_[0] / $_[1] }, '+' => sub { $_[0] + $_[1] }, '-' => sub { $_[0] - $_[1] }, ); my ($val, @rest) = @$tree; $val = execute($val); while (@rest){ my $op = shift @rest; my $next = execute(shift @rest); $val = $ops{$op}->($val, $next); } return $val; } print execute($parse_tree), "\n";

A bit more theory

This parser is called a "recursive descending" parser, because it uses recursion to descent into the rules that finally match terminal symbols.

It is a predictive, i.e. it guesses what the next token is, and it's a top down parser.

Recursive descending parsers can't parse all context free languages, only a subset called LL(k), where k is the number of lookahead tokens (1 in our case).

Instead of predicting the next token, we could read all the tokens and push them onto a stack. When the top items on the stack correspond to the right hand side of a grammar rule we can remove them, and push the rule (annoted with the match tokens, of course) onto the stack.

That technique is bottom up parsing (or LR parsing), and the languages that can be parsed are called "deterministic context free languages", short DCFL. It is more flexible, but also more complicated.

Parser generators like yacc or bison usually produce such LR parsers.

More on Recursive Descending Parsers

You will have noticed that the subs expression and term have the same structure, and only differ in the operator and the next function to be called.

One way around that duplicate code is a function factory (you'll notice that I've started to read "Higher Order Perl" ;-). Very good stuff!):

sub make_rule { my ($operator, $next) = @_; return sub { my @res = ($next->()); while (lookahead($operator)){ push @res, match($operator); push @res, $next->(); } return @res > 1 ? \@res : $res[0]; }; } *term = make_rule('MulOp', \&factor); *expression = make_rule('AddOp', \&term);

Since you need such a rule for every precedence level, that can save save you a lot of typing.

But if you write more parsers, you'll notice that a lot of work will still be duplicated.

A good alternative is a module like Parse::RecDescent which does all the work that you normally duplicate (though I admit that I've never used it myself).

Perl 6 Rules, the replacement for regular expressions, are another alternative. And don't start screaming "but it's not there yet" - it is. You can use the Parrot Grammar Engine (PGE) to write Perl 6-like rules.

A Perl 6 Rule comes in three flavours: "regex", "rule" and "token". A regex is just what you think it is: a regex, but not very regular. A "token" is a regex that doesn't backtrack, and a "rule" is a regex that doesn't backtrack and that treats every whitespace in the rule body as an optional whitespace in the input.

So this is what our grammar would look like in Perl 6, with the old one as a comparsion:

grammar Arithmetic { # expression -> (term add_op)* term rule expression { [ <term> <mul_op> ]* <term> } # term -> (factor mul_op)* factor rule term { [ <factor> <mul_op> ]* <factor> } # factor -> atom | '(' expression ')' rule factor { | <atom> | '(' <expression> ')' } # add_op -> '+' | '-' token add_op { <[+-]> # that's a char class } # mul_op -> '*' | '/' token mul_op { <[*/]> } token atom { \d+ } } # match it: $input ~~ Grammar.expression;

Note that [...] are non-capturing groups, while char classes are <[...]> now. <name> is a call to another rule, and it's result is stored in a named capture with the name of the rule (which can be change with <capturename=rulename>).

In case of a syntax error this regex will just not match, which is usually not the desired result. You can add a bit of error handling code:

rule factor { | <atom> | '(' <expression> [ ')' || <panic: Expected ')'> ] }

Now when a closing parenthesis is missing you'll get a nice error message.

Perl 6 will have an even better mechanism for parsing arithmetic expressions: you can specify what an operator and a what an atom is, and specify the precedence and associativity of each operator. Then Perl uses a bottom-up parser to does what you want:

# this example mixes official Perl 6 syntax an PGE syntax, # I don't really know if they are 100% compatible # see https://svn.perl.org/parrot/trunk/docs/pct/pct_optable_guide.pod # and http://perlcabal.org/syn/S05.html rule factor { | <atom> | '(' <expression> ')' } token infix:<*> { <sym> }; # <sym> stands for a literal * here token infix:</> is equiv('infix:*') { <sym> }; token infix:<+> is looser('infix:*') { <sym> }; token infix:<-> is equiv('inifx:+') { <sym> }; # put parser in bottom-up mode: rule expression is optable { ... }; # yes, '...' is valid Perl 6 ;-) proto 'term:' is tighter('infix:*') is parsed(&factor) { ... };

We write a rule for each infix operator so that one can simply inherit from the grammar (like from a class) and override one rule without having to worry about the rest:

ModernArithmetic is Arithmetc { token infix:</> { '' } # different symbol, same rule name }

Summary

You've seen that parsing isn't magic, and while it takes a little while to get use it, it's not all that hard if you can stick to simple grammars.

In practice a recursive descending parser will get you very far, and it's not hard to write one. You can either do it manually, or use a module or Perl 6's fabolous rules.

There's much more to say about parsing, for example about error handling and recovery, associativity and many other topics, but it's a start ;-)

(And as usual I couldn't resist to introduce some Perl 6 stuff. You can see that from Perl 5 to Perl 6 we jump from "regular" to context free languages, which is quite a step in the Chompsky hierarchy.)

Update: planetscape noted that this is Tutorials material, so regards this as an RFC.

Second update: added missing tokens for OpenParen and CloseParen, again after feedback from planetscape++.

Comment on RFC: Parsing with perl - Regexes and beyond
Select or Download Code
Re: Parsing with perl - Regexes and beyond
by BrowserUk (Pope) on Apr 03, 2008 at 08:51 UTC

    How would Chomsky categorise this? (Update: Corrected c&p typo per Erez post below.)

    #! perl -slw use strict; $|++; my %ops = ( '+' => sub{ $_[ 0 ] + $_[ 1 ] }, '-' => sub{ $_[ 0 ] - $_[ 1 ] }, '*' => sub{ $_[ 0 ] * $_[ 1 ] }, '/' => sub{ $_[ 0 ] / $_[ 1 ] }, '**'=> sub{ $_[ 0 ] ** $_[ 1 ] }, ); my @presedence = ( qr[\*\*], qr[\*|/], qr[\+|-], ); my $reVar = qr[[a-z]+]; my $reConst = qr[ [+-]? (?:\d+\.)? \d+ (?: [eE] [+-]? \d+ )? ]x; my $reArg = qr[$reVar|$reConst]; my $reOps = qr[@{[ join '|', map{ quotemeta } keys %ops ]}]; my $reTokenise = qr[\s*($reArg)(?:\s*($reOps))?]; sub parseEvalExpr { my $expr = shift; if( $expr =~ m[$reOps \s+ $reArg \s+ $reOps]x ) { for my $opset ( @presedence ) { return "($expr)" if $expr =~ s[ ( $reArg \s+ $opset \s+ $reArg ) ]{($1) +}x; } } my @tokens = $expr =~ m[$reTokenise]g; pop @tokens unless defined $tokens[ $#tokens ]; while( @tokens > 1 ) { ( my( $arg1, $op, $arg2 ), @tokens ) = @tokens; unshift @tokens, $ops{ $op }->( $arg1, $arg2 ); } return $tokens[ 0 ]; } while( <DATA> ) { chomp; my $testResult = eval; printf "'$_' = "; warn "Unbalanced parens '$_'" and next unless tr[(][] == tr[)][]; while( m[[()\s]] ) { s[ \( ( [^()]+ ) \) ]{ parseEvalExpr( $1 ) }xe while m[[()]]; $_ = parseEvalExpr( $_ ); } print; printf STDERR "*** Discrepancy! Eval gets: %s\n", $testResult unless $_ eq $testResult; } __DATA__ 1 + 2 2 - 1 2 * 1 1 / 2 (((7 + 5) * (9 + 13)) / ((4 + 3) * (17 - 2 + 3))) 23 ** 2 1.1e10 ** -10 1 + 2 * 3 3 + 2 ** 2 ** 2 * 3

    Output

    c:\test>expr_parser.pl '1 + 2' = 3 '2 - 1' = 1 '2 * 1' = 2 '1 / 2' = 0.5 '(((7 + 5) * (9 + 13)) / ((4 + 3) * (17 - 2 + 3)))' = 2.0952380952381 '23 ** 2' = 529 '1.1e10 ** -10' = 3.85543289429532e-101 '1 + 2 * 3' = 7 '3 + 2 ** 2 ** 2 * 3' = 51

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      How would Chompsky categorise this?

      He wouldn't, because you didn't specify a grammar. You implemented one in a Turing machine (aka perl).

      The implicit, underlying grammar for parsing is not in RE. It's in CFL, even in DCFL. In this case it's LL(1) (you need one character lookahead to distinguish * and ** tokens).

      Ironically you usually think of LL(1) in terms of top down parsers, that is "all grammars that can be matched by recursive descending parsers with one token lookahead" (not an exact definition, of course), but you use a bottom up technique.

      The technique is none of the standard techniques I know, which all read the input string (or stream) from left to right, while you perform multiple passes (with while( m[[()\s]] )).

      Update:

      The computation of the result doesn't work in CFL, of course. If you're really a tough guy you can evaluate such things in context sensitive grammars (but it's much work), and for that you need, in general, a linearly bounded Turing machine (LTM).

      In the normal computation model you can't modify the input tape, and the size of a supplemental, rw-tape is the size the "linear" refers to. And since the algorithm above uses a complete copy of the input string, it also uses (at least) a LBM ;-)

      Is Chompsky a play on Noam Chomsky and chomp, or we're talking about a different person here?

      Software speaks in tongues of man.
      Stop saying 'script'. Stop saying 'line-noise'.
      We have nothing to lose but our metaphors.

        Maybe it's just a typo? At least it was from me ;-)

        But it's indeed Noam Chomsky we're talking about.

Re: RFC: Parsing with perl - Regexes and beyond
by sundialsvc4 (Monsignor) on Apr 03, 2008 at 14:22 UTC

    You may or may not find that “recursive descent” parsing is right for you. Recursive-descent is very appropriate when there's a lot to do at each point, and the structure of the input-language is very consistent and predictable. Cases where the input is, if you will, “heavy and full of meaning.”

    But there is another approach that is sometimes quite useful:   it's a more grammar-driven, state-table-driven approach, the one used by bison or yacc, in which a grammar-following engine is chugging through its internal description of the language, building trees (“hashes and lists of hashes and lists”) along the way, and calling-out to your code as pure subroutines along the way.

    In this scenario, that “grammar-following engine” is driving the bus for the entire time. It calls-out to you at the prescribed times, passing your routines the information they're supposed to receive, and expecting each routine to mind its own business. The complexities of the language being processed are not represented in (nor should they be represented in) the discrete subroutines that you write... unlike recursive-descent, where more-or-less the opposite is true.

    Which is “better?” That question can't be answered categorically: “it depends.” It's enough to know that in Perl both approaches are available to you. Each approach is distinct, and each approach is distinct for a to-it appropriate reason. That's why you get to choose.

    This is, or can be, a very deep pool ... but don't go any deeper than you actually need to go, unless you're curious and you've brought your life-jacket with you.

    I applaud this piece because ... parsers are an extremely powerful technology that you need to become aware-of. I'm not quite sure why they get relegated to the cloistered halls of graduate-school. They're actually relatively straightforward these days. So... if you're wrestling your way through “regular expression hell,” remember that ... especially in Perl-land ... There's More Than One Way To Do It.™

Re: RFC: Parsing with perl - Regexes and beyond
by blokhead (Monsignor) on Apr 03, 2008 at 15:25 UTC
    When talking about lexing, you say:
    Parsing CFLs takes O(n) time.
    This is a little misleading. LALR parsers take linear time (where the size of its lookup tables is taken to be a constant, since it is independent of the input string to be parsed), while recursive descent can in principle take exponential time.

    I'm guessing you are thinking of the CKY algorithm, which does take cubic time. This algorithm's importance is strictly theoretical, since it can be used to efficiently parse completely arbitrary grammars. In practice, we exploit the extra structure possesed by most practical grammars to obtain faster parsing than CKY provides.

    So I don't think lexing is a matter of running-time efficiency. Its benefit is that if you blindly parse a character stream without lexing, you would have a separate leaf in the parse tree for every character (including whitespace!). Tokens like /\d+/, if expressed directly in the context-free grammar and parsed one character at a time, would then lead to very unwieldy parse trees like this:

    "123456" --> [1, [2, [3, [4, [5, 6]]]]]
    Lexing "collects" all these nasty bottom sections of the parse tree, which would otherwise be everywhere, into chunks that reflect how the parse tree will be used. Of course this could also be cleaned up with special cases in the parser, but lexing separates out this step in an elegant way.

    PS: "Chompsky" (with a p) is a name commonly given to pet dogs by linguists... either as an insult or homage to Chomsky, depending on which side of the great linguistic divide you fall ;)

    blokhead

      This is a little misleading.

      Well, it's the worst case runtime. In the normal case it's not that bad, but the fact remains that in general lexing is more efficient than parsing.

      You're right that most grammars can be parsed in linear time (actually many grammars have their current shape to facilitate parsing).

      When I was a teen, there was a chimpanzee named Nim Chimpsky that was far better known among my peers than any dog named Chompsky or any linguist named Chomsky.

      They used the animal for sign language experiments.

Re: RFC: Parsing with perl - Regexes and beyond
by pKai (Priest) on Apr 03, 2008 at 21:21 UTC
    The first two definitions are recursive, so they are not "regular" any more.

    You can argue that the first production is "tail recursive" which does not pose a problem with respect to regularity. Like a tail recursive function can be transformed into a loop.

    The non-regularity comes with the 2nd production.

    term        -> '(' term ')'

    The point is that term can grow to arbitrary length and we still should keep the correspondence between the two parentheses, which is not possible with (CS)-Regexes ("pumping lemma").

    This last observation is the border where you need a CFL-parser and would be lost with Regexes alone.

Re: RFC: Parsing with perl - Regexes and beyond
by goibhniu (Hermit) on Apr 04, 2008 at 18:37 UTC

    I've read about 1/3 of this and fully intend to finish it. Overall, I've given it ++. The following is intended as a constructive comment in the "RFC" sense.

    I think I self-identify with your stated audience, Perl hackers that can already work with regular expressions, but don't have any formal Computer Science eduction, and don't really know how to parse things that are too complex for regular expressions. I'm currently trying my hand at an applied parsing problem (Reversible parsing (with Parse::RecDescent?)). I got as far as I did searching CPAN for "parse" and reading the doc. I'd say I got to a practical (in the parse direction, not in the reverse direction), albeit simple solution. I learned alot more from ikegami's reply. I'm also learning from the documentation for Parse::Marpa

    The criticism is that I'm not sure I could have gotten to that level of practical results from yout tutorial. I think your tutorial has aimed a little high for the stated target audience. Terms like "deterministic finite automaton", "linear time", "Context Free Languages" and the whole "A bit more theory" section slow me down a little and make it a dense read. Having gone through the work I already went through, your tutorial is helpful and instructive, especially in defining some terms that the other sources have been throwing around. This gives me good academic material to compliment the quick-n-dirty applicable stuff I got from Parse::RecDescent's documentation, et al..

    Just to be clear, I find this to be immensely informative, and am enjoying working my way through it. The academic terms are defined as you use them for the most part, and I'm learning alot. I think the only edit I'd suggest (partly becasue the whole article seems a little above me for me to be making significant content suggestions) is to restate the intended audience. This is an intermediate-to-advanced text that fills a niche between Perl hackers that can already work with regular expressions, but don't have any formal Computer Science eduction and the heavily technical papers that are intended for academia. Said Perl Hackers can use this to improve themselves, but I found it a little beyond (though certainly comlementary to) "practical".


    #my sig used to say 'I humbly seek wisdom. '. Now it says:
    use strict;
    use warnings;
    I humbly seek wisdom.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://678119]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-09-02 17:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (25 votes), past polls