An adventure with 5.10 regular expressions

Shortly after perl 5.9.5 was released I was impressed by this presentation by Yves Orton, particularly grammars. When 5.10 was released I thought I'd play around a bit and learn how to use the new features, especially recursion and grammars. After a couple of false starts, mainly caused by not understanding the documentation, I decided on the goals.

It had to be simple
I could express the grammar as a Backus Normal Form
I wanted to use recursion
I would have fun

The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN). I still pine for my HP41C;-)

An easy start

The BNF is fairly straight forward.

expr      := number
           | expr ws unary_operator 
           | expr ws expr ws binary_operator
number    := digit {digit} [ '.' {digit} ]
           | '.' digit {digit}
digit     := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
unary_op  := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan"
binary_op := '+' | '-' | '*' | '/' | '^'
ws        := ' ' | '\t' | '\r' | '\n'

Note: I use "chs" as shorthand for 'c' 'h' 's'
[download]

Translation from a BNF production to a perl regular expression is not difficult. Production rules have names and refer to each other by name, perl regexes now have named captures and they can fill the same role. For example

digit  -> /(?<digit> \d )/msx
number -> /(?<number> (?&digit)+ (?:[.](?&digit)*)
                |     [.](?&digit)+ )/msx
[download]

in practice digit and ws are so simple I just inlined them.

This is the complete regular expression generated from that BNF to match an arbitrary RPN expression.

qr{
(?(DEFINE)
    (?<expr>
        (?&number)
    |   (?&expr) \s+ (&unary_op)
    |   (?&expr) \s+ (?&expr) \s+ (&binary_op)
    )
    (?<number>
        \d+ (?: [.] \d* )?
    |   [.] \d+
    )
    (?<unary_op>
        (?: chs | abs | sqr | sqrt | sin | cos | tan )
    )
    (?<binary_op>
        [-+/*^]
    )
)
    \A \s* (?&expr) \s* \z
}msx;
[download]

Feeling fairly pleased with myself I started to test it and second try, matching against '12 34 +', I got Infinite recursion in regex at rpg.pl line 136. I tinkered a bit sticking print statements into the regex. The regex gets stuck at position 0 and I couldn't figure out how to break the recursion. The problem is that I need some way to force the expr clause to never backtrack if the number alternative matches. For example when matching '12 chs' the number alternative of expr matches 12 but the whole fails because the end of the string hasn't been reached, it then tries expr ws unary_op, recurses back to expr and number grabs 12 but then fails because it hasn't got to the end of the string, so it backtracks out and it now attempts expr again, and again, and...

I probably should have paid more attention when the lecturer talked about top down parsers and LLR grammars. If any monk can shed light I will be most happy. As an aside, I discovered that say isn't recognized in a regex code block as I got this error

String found where operator expected at (re_eval 10) line 1, near "say
+ "matching at ""
    (Do you need to predeclare say?)
syntax error at (re_eval 10) line 1, near "say "matching at ""
Compilation failed in regexp at rpg.pl line 137.
[download]

It also gives an error for given/when, I didn't test state as that is too bizarre.

Backtracking regexes

It looked like I needed to rethink my approach. If I were writing an RPN evaluator from scratch, I'd

parse the string from the start
pick out the tokens
check the first token is a number
push numbers onto a stack
apply unary operators to the top of the stack
pop the top element and apply a binary operator to it and the top of the stack
check the stack had exactly one number at the end of the input

This still looks like a good match for regular expressions and I can reuse some of the definitions from the failed attempt.

My next regex was this

(?(DEFINE)
    (?<expr>
        (?&number)
        (?:
            \s*
            (?:
                (?&number)
            |   (?&unary_op)
            |   (?&binary_op)
            )
        )*
    )
    (?<number>
        (?> \d+ (?: [.] \d* )? | [.] \d+ )
        (?{ $^R + 1})
    )
    (?<unary_op>
        (?i: chs | abs | sqr | sqrt | sin | cos | tan ) 
        (?![a-zA-Z])
    )
    (?<binary_op>
        [-+/*^]
        (?{ $^R - 1 })
    )
)

    \A \s* (?&expr) \s* \z
    (?(?{ $^R != 1 }) (*FAIL) ) 
}msx
[download]

The special variable $^R contains the value of the last (?{ code }) block executed. Is localized so that it's value is restored on backtracking. I use it to keep track of the number of numbers seen. This did indeed pass my tests.

A working calculator

Having got this far it would seem a simple matter to extract the tokens but this paragraph from perlre points out the difficulties ahead.

An example of how this might be used is as follows:
  /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
      (?(DEFINE)
          (?<NAME_PAT>....)
          (?<ADRESS_PAT>....)
      )/x
[download]
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.

The entire RPN expression can be matched with /($RPN)/ but you can't access any of the internal matches made during a recursive regex. I wanted to somehow capture the internal matches and use them to evaluate the expression. There is a bit of work to do. Some points to remember

Lexical variables don't reliably work with recursive regexes, need to use globals
Using local when modifying a variable restores the value when backtracking occurs
Global variables need to be in a known good state before starting the match, local helps here as the original value is restored on regex exit. You could reset them before using the regex but it seems better practice to let the regex do it.

To keep track of numbers I used @stack and modified the number rule by adding capturing parentheses, $^N is the value of the last capture and the code section pushes it onto the stack.

    (?<number>
        ( (?> \d+ (?: [.] \d* )? | [.] \d+ ) )
        (?{ local @stack = ( @stack, $^N ) })
    )
[download]

I made similar transformations of unary_op and binary_op. This is the complete regex.

qr{
(?(DEFINE)
    (?<expr>
        (?&number)
        (?:
            \s+
            (?:
                (?&number)
            |   (?&unary_op)
            |   (?&binary_op)
            )
        )*
    )
    (?<number>
        ( \d+ (?: [.] \d* )? | [.] \d+ )
        (?{ local @stack = ( @stack, $^N ) })
    )
    (?<unary_op>
        ( chs | abs | sqr | sqrt | sin | cos | tan )
        (?(?{ @stack < 1 }) (*FAIL) )
        (?{ local @stack = @stack; $operators{$^N}->($stack[-1]) })
    )
    (?<binary_op>
        ( [-+/*^] )
        (?(?{ @stack < 2 }) (*FAIL) )
        (?{ local @stack = @stack; $operators{$^N}->($stack[-2], pop @
+stack) })
    )
)

    \A \s* (?&expr) \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
    |
    (?{ $result = undef }) (*FAIL)

}msx
[download]

Note the use of (*FAIL) clauses to check that the stack has the expected number of elements and, combined with alternation, to reset $result to undef if no match occurred.

For the sake of brevity (although I feel I can't be accused of that) I have omitted error checking. A divide by zero or square root of a negative number will cause the regex to die.

On reflection

It occurred to me later that I could eliminate recursion entirely, without recursion I didn't need the extra level of capturing parentheses as I could use $+{foo} instead. Also I could put the number | unary_op | binary_op clause in (?> ). Without backtracking I no longer need to localize @stack. I did have to ensure that @stack was initialized to () at the start of the regex. $result = undef was also moved to the start. I think it better to initialize variables before use rather than relying on them being put back after the last time (I have small children, I know it doesn't happen;)

qr{
    (?{ @stack  = ();
        $result = undef;
    })
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ push @stack, $+{start_number} })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ push @stack, $+{number} })
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
            (?(?{ @stack < 1 }) (*FAIL) )
            (?{ $operators{ $+{unary_op} }->($stack[-1]) })
        |
            (?<binary_op> [-+/*^] )
            (?(?{ @stack < 2 }) (*FAIL) )
            (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) 
+})
        )
    )*
    \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
}msx,
[download]

A small gotcha was that any expression with sqrt in it failed. The regex matched sqr and could not backtrack to try sqrt, the solution I used was to reverse the order of the two words, I could also have combined them as sqrt?.

Emma Chissit

A popular clash of Australian and English English occurred in 1964 when Monica Dickens, an English writer, was signing her latest book in Sydney. As told by the Sydney Morning Herald, 30 November 1964, the conversation between one of the female Australians in the queue and the author was as follows:
Aussie: Emma Chissit
Author, thinking that was her name, wrote "To Emma Chissit" in the book cover.
Aussie (speaking deliberately): No, emma chissit?

The australian was, of course, speaking strine and 'How much is it?' was the question.

To see what is the overhead of recursion I used Benchmark.

my $input = join ' ', (1) x 1_001, ('+') x 1_000;

cmpthese(
    100,
    {
        recursive => sub { $input1 =~ $RPN1 },
        iterative => sub { $input2 =~ $RPN2 },
    },
);
            s/iter recursive      flat
recursive     1.35        --      -99%
iterative 1.47e-02     9103%        --
[download]

Recursion was more expensive than I had expected, Admittedly this is a pathalogical case where the stack becomes huge. Where the stack has at most 2 elements the difference is considerably less.

# my $input = join ' ', 1, ('1 +') x 1_000;
            Rate recursive iterative
recursive 40.5/s        --      -40%
iterative 67.6/s       67%        --
[download]

Even so a relatively small input and stack suffers.

# my $input = join ' ', (1) x 4, ('+') x 3;
             Rate recursive iterative
recursive 14306/s        --      -17%
iterative 17271/s       21%        --
[download]

Just to see how much the evaluation of the RPN expression affected the result I ran the a benchmark against the same data but with non evaluating regexes that use $^R to keep track of the number of numbers seen. That is the first working regex and this one adapted from the iterative version.

qr{
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ 1 })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ $^R + 1})
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
        |
            (?<binary_op> [-+/*^] )
            (?{ $^R - 1 })
        )
    )*
    \s* \z
    (?(?{ $^R != 1 }) (*FAIL) )
}msx,
[download]

# my $input = join ' ', (1) x 1_001, ('+') x 1_000;
            Rate recursive iterative
recursive 83.6/s        --      -80%
iterative  415/s      396%        --

# my $input = join ' ', 1, ('1 +') x 1_000;
            Rate recursive iterative
recursive 82.2/s        --      -78%
iterative  376/s      357%        --

# my $input = join ' ', (1) x 4, ('+') x 3;
             Rate recursive iterative
recursive 28249/s        --      -69%
iterative 92593/s      228%        --
[download]

As you can see the difference is quite marked.

Here is a script that tests the regex.

#!//net/perl/5.10.0/bin/perl
use warnings;
use strict;

use 5.010_000;

our @stack;
our $result;

my %operators = (
    '+' => sub { $_[0] += $_[1] },
    '-' => sub { $_[0] -= $_[1] },
    '/' => sub { $_[0] /= $_[1] },
    '*' => sub { $_[0] *= $_[1] },
    '^'  => sub { $_[0]**= $_[1] },
    chs  => sub { $_[0] = -$_[0] },
    abs  => sub { $_[0] = abs $_[0] },
    sqr  => sub { $_[0] *= $_[0] },
    sqrt => sub { $_[0] = sqrt $_[0] },
    sin  => sub { $_[0] = sin $_[0] },
    cos  => sub { $_[0] = cos $_[0] },
    tan  => sub { $_[0] = sin $_[0] / cos $_[0] },
);

my $RPN = qr{
    (?{ @stack = () })
    (?{ $result = undef })
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ push @stack, $+{start_number} })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ push @stack, $+{number} })
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
            (?(?{ @stack < 1 }) (*FAIL) )
            (?{ $operators{ $+{unary_op} }->($stack[-1]) })
        |
            (?<binary_op> [-+/*^] )
            (?(?{ @stack < 2 }) (*FAIL) )
            (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) 
+})
        )
    )*
    \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
}msx,

my %input = (
    'fail'                => undef,
    '1 fail'              => undef,
    '123'                 => 123,
    '123.456'             => 123.456,
    '123.456 +'           => undef,
    '.456'                => .456,
    '123.'                => 123.,
    '123 /'               => undef,
    ' 123 '               => 123,
    '123 chs'             => -123,
    '123 chs chs'         => 123,
    '123 chs chs chs'     => -123,
    '123 chs chs chs chs' => 123,
    '4 2 ^'               => 16,
    '4 2 /'               => 2,
    '4 2 +'               => 6,
    '4 2 *'               => 8,
    '4 2 -'               => 2,
    '12 34+'              => undef,
    '12 34'               => undef,
    '12 34 + +'           => undef,
    '12 34 56 + +'        => 102,
    '12 34 + 56 +'        => 102,
    '123 chs 34 *'        => -4_182,
    '123 34 * chs'        => -4_182,
    '24'                  => 24,
    '24 abs'              => 24,
    '24abs'               => undef,
    '24abssqrt'           => undef,
    '1 24abs+sqrt'        => undef,
    '1 24 abs + sqrt'     => 5,
    '24 chs abs'          => 24,
    '24 chs'              => -24,
    '25 abs sqrt'         => 5,
    '25 chs abs sqrt'     => 5,
    '24 4 + chs sqr'      => 784,
    '34 2 / 2 ^'          => 289,
    '3 34 2 / 2 ^ *'      => 867,
);

foreach my $line ( keys %input ) {
    my ($expression) = $line =~ /($RPN)/;
    if ( !defined $result && !defined $input{$line} ) {
        say "OK: $line is invalid, got no result";
    }
    elsif ( !defined $result ) {
        say "FAILED: $line expected $input{$line}, got no result";
    }
    elsif ( !defined $input{$line} ) {
        say "FAILED: $line got $result, expected no result";
    }
    elsif ( $result == $input{$line} ) {
        say "OK: $expression = $result";
    }
    else {
        say "FAILED: $line got $result, expected $input{$line}";
    }
}
[download]

Final thoughts

regextidy would make my day.
A correct BNF is useful even if you can't turn it into a working regular expression.
The new regular features are powerful.
I don't yet completely understand how to use them.
Capturing matches during recursion requires hand crafting.
Practice makes, if not perfect, then better.
Recursion is expensive.
I had fun;-)
regextidy would really make my day.

Postscript

I started by looking for a good introdutory tutorial for the new features and, when I couldn't find anything, then to write a simple tutorial showing most of them. It didn't turn out that way but at the least I warn others of the pitfalls and, I hope, show a field worth exploring. With help I think I can write such an article so all criticism, tips and references will be appreciated. This is my first long posting so any suggestions regarding presentation are especially welcome.

References

Comment on An adventure with 5.10 regular expressions Select or Download Code

Replies are listed 'Best First'.
Re: An adventure with 5.10 regular expressions by moritz (Cardinal) on Feb 05, 2008 at 13:58 UTC
Very nice reading, ++. After Perl 5.10 release I started to write a regex that parses XML, and ran into similar problems. What I learned during crafting that expression: nested regexes are complicated They are still buggy (I found two bugs in the re engine in two days) There's a reason we want Perl 6 regexes/rules `(?{print "matched so far: $&\n"})` blocks really help while debugging Use atomic groups wherever possible. Not only for performance reasons, but also to avoid confusing, multiple execution of embedded closures	[reply] [d/l]
Re: An adventure with 5.10 regular expressions by JStrom (Pilgrim) on Feb 07, 2008 at 02:33 UTC
This might clear things up. There's a slight problem with your grammar: `expr := number \| expr ws unary_operator \| expr ws expr ws binary_operator` [download] The first part of an expr can be another rule. This means the parser can't know which choice is the correct one until it follows that rule and hits an error. Usually one will manipulate the grammer so that a token always upfront. Otherwise you run the risk of choosing the wrong rule and having to backtrack. For example: `RULE1 = RULE2 \| RULE3 RULE2 = A B RULE3 = A C -- becomes -- RULE1x = A ( RULE2x \| RULE3x ) RULE2x = B RULE3x = C` [download] In this case it's especially nasty since the rule recurses and you get something like: `Parse "1 2 +" as an expr Try 'number' Parse 'number', left with "2 +" Parse 'end', PANIC! Try 'expr unary' Parse "1 2 +" as an expr Try 'number' Parse 'number', left with "2 +" Parse 'end', PANIC! Try 'expr unary' ...` [download] The problem with RPN here is that you can't know how many exprs to parse for until you've seen the operators. If you somehow managed to prevent the recursion, you'd still see performance like: `Parse "1 2 +" as an expr Try 'number' Parse 'number', left with "2 +" Parse 'end', PANIC! Try 'expr unary' Parse 'number', left with "2 +" Parse 'unary', PANIC! Try 'expr binary' Parse 'number', left with "2 +" Parse 'number', left with "+" Parse 'binary', left with "" Parse 'end', done!` [download] Imagine the backtracking with a more complex expression! Don't give up hope though. You need to know the operators at the end of the RPN expression to parse via the BNF. The simple solution is to do just that: my %ops = ( '+' => sub { $_[0] + $_[1] }, '-' => sub { $_[0] - $_[1] }, '' => sub { $_[0] $_[1] }, '/' => sub { $_[0] / $_[1] }, '^' => sub { $_[0] ** $_[1] }, 'nat' => sub { sin($_[0]) / cos($_[0]) }, 'soc' => sub { cos($_[0]) }, 'nis' => sub { sin($_[0]) }, 'trqs' => sub { sqrt($_[0]) }, 'rqs' => sub { $_[0] * $_[0] }, 'sba' => sub { abs($_[0]) }, 'shc' => sub { -$_[0] }, ); $RPN3 = qr# (?(DEFINE) (?<expr> (?> ((?&number)) (?{ push @stack, scalar reverse $^N }) ) \| (?: (?> ((?&unary)) ) \s+ (?> (?&expr) ) (?{ push @stack, +$ops{$^N}->( pop @stack ) }) ) \| (?: (?> ((?&binary)) ) \s+ (?> (?&expr) ) \s+ (?> (?&expr) + ) (?{ push @stack, $ops{$^N}->( pop @stack, pop @stack ) }) ) ) (?<number> ( (?> \d* ) (?> [.] ) (?> \d+ ) \| (?> \d+ ) (?> [.] ) \| \d ++ ) ) (?<unary> ( shc \| sba \| rqs \| trqs \| nis \| soc \| nat ) ) (?<binary> ( [-+/^] ) ) ) (?{ @stack = (); $result = undef; }) ^ \s (?&expr) (?{ $result = $stack[0] }) \s* $ #sx; (reverse $line) =~ $RPN3; [download] Since the only time this regular expression will backtrack is a failure, I can save a few processing steps here and there. For one, every token is now encased in a (?> ... ) construct (just like Perl6 tokens only more verbose! (speaking of Perl6, its longest token rule would have solved your problem with sqr and sqrt)). If the needed expression isn't there, no amount of backtracking is going to create it. Also since I'm not worried about backtracking all those local()s are gone. I see about a 110% speed increase over the forwards method.	[reply] [d/l] [select]
Re^2: An adventure with 5.10 regular expressions by hipowls (Curate) on Feb 07, 2008 at 08:14 UTC
Brilliant! I knew I was overlooking something simple (that doesn't mean obvious). My only criticism is that the regex should be `$NPR = qr{...}` ;)	[reply] [d/l]
Re: An adventure with 5.10 regular expressions by ikegami (Patriarch) on Feb 07, 2008 at 14:44 UTC
I haven't read everything yet, but I thought I'd post a link to a tutorial of mine before I come back to this. Operator Associativity and Eliminating Left-Recursion in Parse::RecDescent was written for Parse::RecDescent, but the principles apply to other LL parsers too. Quickly, I think `expr : expr expr binop { [ BINOP => [ $item[3], $item[1], $item[2] ] ] } \| term` [download] can be transformed into `expr : term { push($item[1]) } expr_ \| term expr_ : expr binop { [ BINOP => [ $item[2], pop(), $item[1] ] ] }` [download] Factoring out common prefixes for speed: `expr : term { push($item[1]) } expr_1 expr_1 : expr_2 \| # Automatic match { pop() } expr_2 : expr binop { [ BINOP => [ $item[2], pop(), $item[1] ] ] }` [download] Inlining expr_2 for simplicity: `expr : term { push($item[1]) } expr_ expr_ : expr binop { [ BINOP => [ $item[2], pop(), $item[1] ] ] } \| # Automatic match { pop() }` [download] In short, keep looking for terms until you meet something that isn't (an operator), then eat up the last two terms that have been met. I'll come back later to verify this and put it into 5.10 regex syntax. Update: Fixed link	[reply] [d/l] [select]
Re^2: An adventure with 5.10 regular expressions by hipowls (Curate) on Feb 09, 2008 at 09:49 UTC
Thanks for the link, it is great article. I've only skimmed it but it gets to the nub of the problem I had. I'll reread it and explore further, this time with a map;)	[reply]
Re^3: An adventure with 5.10 regular expressions by ikegami (Patriarch) on Feb 09, 2008 at 17:17 UTC
The `push` and `pop` aren't accurate. I was trying to emulate P::RD's args. At the very least, the `push` needs to undo on backtracking (which is easy to implement even in pre-5.10). I need to read up on Perl 5.10 regexps before I can deliver a regexp version of that grammar, but right now, I'm out of town on a small screen. It'll have to wait.	[reply] [d/l] [select]
Re: An adventure with 5.10 regular expressions by Gavin (Archbishop) on Feb 05, 2008 at 14:33 UTC
Fork Handles! or perhaps but not PC Tendjewberrymud	[reply]
Re: An adventure with 5.10 regular expressions by ikegami (Patriarch) on Nov 06, 2008 at 12:25 UTC
I finally got back to this thread! A couple of quick comments first: I wanted to use recursion [...] The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN). That's actually a very poor choice. The advantage of RPN is that it can be parsed with neither backtracking nor recursion. Parsing RPN can be simplified to checking the size of a stack. It occurred to me later that I could eliminate recursion entirely, What should have occurred to you is that you had already eliminated recursion entirely. I feel you left the tracks you intended to visit. You did so when you replaced the recursion with stack checks (e.g. `(?(?{ @stack < 1 }) (FAIL) )`). At that point, you stopped using the regexp engine as a parser since you moved all but the tokenizing back into Perl. Since the intent was to develop a parser as a regexp pattern, let's revisit the recursive method. First, let's go back to the EBNF grammar. `expr := number \| expr ws unary_op \| expr ws expr ws binary_op number := digit {digit} [ '.' {digit} ] \| '.' digit {digit} digit := '0' \| '1' \| '2' \| '3' \| '4' \| '5' \| '6' \| '7' \| '8' \| '9' unary_op := "chs" \| "abs" \| "sqr" \| "sqrt" \| "sin" \| "cos" \| "tan" binary_op := '+' \| '-' \| '' \| '/' \| '^' ws := ' ' \| '\t' \| '\r' \| '\n'` [download] If you tried to build a regexp from that, you'd get `Infinite recursion in regex` [download] When something other than a number is found while trying to match an `expr`, it tries to match an `expr`. It finds something other than a number, so it tries to match an `expr`. It finds something other than a number, so it tries to match an `expr`. ... This type of parser is known as a LL parser. Left recursion ("rule : rule ..." or something that reduces to that) must be eliminated from the rules of a LL parser. This is to what I was alluding in my earlier post. `expr := number \| expr ws unary_op \| expr ws expr ws binary_op` [download] becomes `expr := number expr_ expr_ := ws expr ws binary_op expr_ \| ws unary_op expr_ \|` [download] At this point, building the regexp is straightforward: qr{ (?(DEFINE) (?<expr> (?&number) (?&expr_) ) (?<expr_> (?: \s++ (?: (?&expr) \s++ (?&binary_op) (?&expr_) \| (?&unary_op) (?&expr_) ) )? ) (?<number> ( (?&NUMBER) ) (?{ local @stack = @stack; push @stack, $^N; }) ) (?<unary_op> ( (?&IDENT) ) (?(?{ !exists($unary_ops{$^N}) }) (FAIL) ) (?{ local @stack = @stack; $unary_ops{$^N}->($stack[-1]); }) ) (?<binary_op> ( [-+/^] ) (?{ local @stack = @stack; $binary_ops{$^N}->($stack[-2], pop(@stack)); }) ) # Tokens. # Backtracking can't cause a token # rule to return something different. (?<NUMBER> \d++ (?: [.] \d+ )?+ \| [.] \d++ ) (?<IDENT> [_A-Za-z][_A-Za-z0-9]+ ) ) (?{ local @stack = (); }) \A \s* (?&expr) \s* \z (?{ $result = pop(@stack) }) }sx; [download] Notice the complete lack of stack size check? The grammar enforces the right number of arguments, preventing us from having an invalid stack. By the way, getting an identifier and checking if it's one of the operators is a much more robust then just looking for the operator names. It also results in better error messages. It isn't strictly needed here, but this is an educational venture, right? I made some small changes to the Perl interface, so the harness needs to be changed slightly: #!/usr/bin/perl use 5.010_000; # re features use warnings; use strict; our @stack; our $result; my %unary_ops = ( chs => sub { $_[0] = -$_[0] }, abs => sub { $_[0] = abs $_[0] }, sqr => sub { $_[0] = $_[0] }, sqrt => sub { $_[0] = sqrt $_[0] }, sin => sub { $_[0] = sin $_[0] }, cos => sub { $_[0] = cos $_[0] }, tan => sub { $_[0] = sin $_[0] / cos $_[0] }, ); my %binary_ops = ( '+' => sub { $_[0] += $_[1] }, '-' => sub { $_[0] -= $_[1] }, '/' => sub { $_[0] /= $_[1] }, '' => sub { $_[0] = $_[1] }, '^' => sub { $_[0]= $_[1] }, ); my $RPN = qr{ ... }; my %input = ( ... ); foreach my $line ( keys %input ) { local $result; my ($expression) = $line =~ /($RPN)/; ... } [download] Results: OK: .456 = .456 OK: 123. = 123. OK: 4 2 + = 6 OK: 123 / is invalid, got no result OK: 24 abs = 24 OK: 1 24 abs + sqrt = 5 OK: 12 34 is invalid, got no result OK: 123 34 chs = -4182 OK: 4 2 - = 2 OK: 34 2 / 2 ^ = 289 OK: 24 4 + chs sqr = 784 OK: 123.456 + is invalid, got no result OK: 123 chs 34 * = -4182 OK: fail is invalid, got no result OK: 24 chs abs = 24 OK: 25 abs sqrt = 5 OK: 12 34+ is invalid, got no result OK: 12 34 + + is invalid, got no result OK: 123 chs chs chs = -123 OK: 1 24abs+sqrt is invalid, got no result OK: 12 34 + 56 + = 102 OK: 123 chs = -123 OK: 24 = 24 OK: 1 fail is invalid, got no result OK: 12 34 56 + + = 102 OK: 4 2 ^ = 16 OK: 123 chs chs chs chs = 123 OK: 3 34 2 / 2 ^ * = 867 OK: 25 chs abs sqrt = 5 OK: 123.456 = 123.456 OK: 123 = 123 OK: 123 chs chs = 123 OK: 24 chs = -24 OK: 123 = 123 OK: 24abs is invalid, got no result OK: 4 2 / = 2 OK: 24abssqrt is invalid, got no result OK: 4 2 * = 8 [download]	[reply] [d/l] [select]
Re: An adventure with 5.10 regular expressions by starX (Chaplain) on Feb 07, 2008 at 14:11 UTC
I hate to be one of those "me-too"ers, but this is truly excellent. I regret that I have but one ++ to give. Especially your use of BNF, that was a nice trip down memory lane. --starX www.axisoftime.com	[reply]
Re: An adventure with 5.10 regular expressions by casiano (Pilgrim) on Sep 08, 2009 at 18:34 UTC
Here is a minimal example of infix to postfix translation preserving the correct associativity. For the sake of simplicity I have reduced the grammar to the minus operator: pl@nereida:~/Lperltesting$ cat calc510withactions.pl #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 use v5.10; # Infix to postfix translator using 5.10 regexp # original grammar: # exp -> exp '-' digits # \| digits # # Applying left-recursion elimination we have: # exp -> digits rest # rest -> '-' rest # \| # empty # my @stack; my $regexp = qr{ (?&exp) (?(DEFINE) (?<exp> (?&digits) \s* (?&rest) ) (?<rest> \s* - (?&digits) \s* (?{ push @stack, '-' }) (?&r +est) \| # empty ) (?<digits> \s* (\d+) (?{ push @stack, $^N }) ) ) }xms; my $input = <>; chomp($input); if ($input =~ $regexp) { say "matches: $&\nStack=(@stack)"; } else { say "does not match"; } [download]	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks