The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN). I still pine for my
An easy start
The BNF is fairly straight forward.
expr := number
| expr ws unary_operator
| expr ws expr ws binary_operator
number := digit {digit} [ '.' {digit} ]
| '.' digit {digit}
digit := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
unary_op := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan"
binary_op := '+' | '-' | '*' | '/' | '^'
ws := ' ' | '\t' | '\r' | '\n'
Note: I use "chs" as shorthand for 'c' 'h' 's'
Translation from a BNF production to a perl regular expression is not difficult. Production rules have names and refer to each other by name, perl regexes now have named captures and they can fill the same role. For example
digit -> /(?<digit> \d )/msx
number -> /(?<number> (?&digit)+ (?:[.](?&digit)*)
| [.](?&digit)+ )/msx
in practice digit and ws are so simple I just inlined them.
This is the complete regular expression generated from that BNF to match an arbitrary RPN expression.
qr{
(?(DEFINE)
(?<expr>
(?&number)
| (?&expr) \s+ (&unary_op)
| (?&expr) \s+ (?&expr) \s+ (&binary_op)
)
(?<number>
\d+ (?: [.] \d* )?
| [.] \d+
)
(?<unary_op>
(?: chs | abs | sqr | sqrt | sin | cos | tan )
)
(?<binary_op>
[-+/*^]
)
)
\A \s* (?&expr) \s* \z
}msx;
Feeling fairly pleased with myself I started to test it and second try, matching against
'12 34 +', I got
Infinite recursion in regex at rpg.pl line 136. I tinkered a bit sticking print statements into the regex. The regex gets stuck at position 0 and I couldn't figure out how to break the recursion. The problem is that I need some way to force the
expr clause to never backtrack if the
number alternative matches. For example when matching
'12 chs' the
number alternative of
expr matches 12 but the whole fails because the end of the string hasn't been reached, it then tries
expr ws unary_op, recurses back to
expr and
number grabs 12 but then fails because it hasn't got to the end of the string, so it backtracks out and it now attempts
expr again, and again, and...
I probably should have paid more attention when the lecturer talked about top down parsers and LLR grammars. If any monk can shed light I will be most happy. As an aside, I discovered that say isn't recognized in a regex code block as I got this error
String found where operator expected at (re_eval 10) line 1, near "say
+ "matching at ""
(Do you need to predeclare say?)
syntax error at (re_eval 10) line 1, near "say "matching at ""
Compilation failed in regexp at rpg.pl line 137.
It also gives an error for given/when, I didn't test state as that is too bizarre.
Backtracking regexes
It looked like I needed to rethink my approach. If I were writing an RPN evaluator from scratch, I'd
- parse the string from the start
- pick out the tokens
- check the first token is a number
- push numbers onto a stack
- apply unary operators to the top of the stack
- pop the top element and apply a binary operator to it and the top of the stack
- check the stack had exactly one number at the end of the input
This still looks like a good match for regular expressions and I can reuse some of the definitions from the failed attempt.
My next regex was this
(?(DEFINE)
(?<expr>
(?&number)
(?:
\s*
(?:
(?&number)
| (?&unary_op)
| (?&binary_op)
)
)*
)
(?<number>
(?> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ $^R + 1})
)
(?<unary_op>
(?i: chs | abs | sqr | sqrt | sin | cos | tan )
(?![a-zA-Z])
)
(?<binary_op>
[-+/*^]
(?{ $^R - 1 })
)
)
\A \s* (?&expr) \s* \z
(?(?{ $^R != 1 }) (*FAIL) )
}msx
The special variable $^R contains the value of the last (?{ code }) block executed. Is localized so that it's value is restored on backtracking. I use it to keep track of the number of numbers seen. This did indeed pass my tests.
A working calculator
Having got this far it would seem a simple matter to extract the tokens but this paragraph from perlre points out the difficulties ahead.
An example of how this might be used is as follows:
/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
(?(DEFINE)
(?<NAME_PAT>....)
(?<ADRESS_PAT>....)
)/x
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
The entire RPN expression can be matched with /($RPN)/ but you can't access any of the internal matches made during a recursive regex. I wanted to somehow capture the internal matches and use them to evaluate the expression. There is a bit of work to do. Some points to remember
- Lexical variables don't reliably work with recursive regexes, need to use globals
- Using local when modifying a variable restores the value when backtracking occurs
- Global variables need to be in a known good state before starting the match, local helps here as the original value is restored on regex exit. You could reset them before using the regex but it seems better practice to let the regex do it.
To keep track of numbers I used @stack and modified the number rule by adding capturing parentheses, $^N is the value of the last capture and the code section pushes it onto the stack.
(?<number>
( (?> \d+ (?: [.] \d* )? | [.] \d+ ) )
(?{ local @stack = ( @stack, $^N ) })
)
I made similar transformations of unary_op and binary_op.
This is the complete regex.
qr{
(?(DEFINE)
(?<expr>
(?&number)
(?:
\s+
(?:
(?&number)
| (?&unary_op)
| (?&binary_op)
)
)*
)
(?<number>
( \d+ (?: [.] \d* )? | [.] \d+ )
(?{ local @stack = ( @stack, $^N ) })
)
(?<unary_op>
( chs | abs | sqr | sqrt | sin | cos | tan )
(?(?{ @stack < 1 }) (*FAIL) )
(?{ local @stack = @stack; $operators{$^N}->($stack[-1]) })
)
(?<binary_op>
( [-+/*^] )
(?(?{ @stack < 2 }) (*FAIL) )
(?{ local @stack = @stack; $operators{$^N}->($stack[-2], pop @
+stack) })
)
)
\A \s* (?&expr) \s* \z
(?(?{ @stack != 1 }) (*FAIL) )
(?{ $result = $stack[0] })
|
(?{ $result = undef }) (*FAIL)
}msx
Note the use of (*FAIL) clauses to check that the stack has the expected number of elements and, combined with alternation, to reset $result to undef if no match occurred.
For the sake of brevity (although I feel I can't be accused of that) I have omitted error checking. A divide by zero or square root of a negative number will cause the regex to die.
On reflection
It occurred to me later that I could eliminate recursion entirely, without recursion I didn't need the extra level of capturing parentheses as I could use $+{foo} instead. Also I could put the number | unary_op | binary_op clause in (?> ). Without backtracking I no longer need to localize @stack. I did have to ensure that @stack was initialized to () at the start of the regex. $result = undef was also moved to the start. I think it better to initialize variables before use rather than relying on them being put back after the last time (I have small children, I know it doesn't happen;)
qr{
(?{ @stack = ();
$result = undef;
})
\A \s*
(?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ push @stack, $+{start_number} })
(?>
\s+
(?:
(?<number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ push @stack, $+{number} })
|
(?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
(?(?{ @stack < 1 }) (*FAIL) )
(?{ $operators{ $+{unary_op} }->($stack[-1]) })
|
(?<binary_op> [-+/*^] )
(?(?{ @stack < 2 }) (*FAIL) )
(?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack)
+})
)
)*
\s* \z
(?(?{ @stack != 1 }) (*FAIL) )
(?{ $result = $stack[0] })
}msx,
A small gotcha was that any expression with sqrt in it failed. The regex matched sqr and could not backtrack to try sqrt, the solution I used was to reverse the order of the two words, I could also have combined them as sqrt?.
Emma Chissit
A popular clash of Australian and English English occurred in 1964 when Monica Dickens, an English writer, was signing her latest book in Sydney. As told by the Sydney Morning Herald, 30 November 1964, the conversation between one of the female Australians in the queue and the author was as follows:
Aussie: Emma Chissit
Author, thinking that was her name, wrote "To Emma Chissit" in the book cover.
Aussie (speaking deliberately): No, emma chissit?
The australian was, of course, speaking strine and 'How much is it?' was the question.
To see what is the overhead of recursion I used Benchmark.
my $input = join ' ', (1) x 1_001, ('+') x 1_000;
cmpthese(
100,
{
recursive => sub { $input1 =~ $RPN1 },
iterative => sub { $input2 =~ $RPN2 },
},
);
s/iter recursive flat
recursive 1.35 -- -99%
iterative 1.47e-02 9103% --
Recursion was more expensive than I had expected, Admittedly this is a pathalogical case where the stack becomes huge. Where the stack has at most 2 elements the difference is considerably less.
# my $input = join ' ', 1, ('1 +') x 1_000;
Rate recursive iterative
recursive 40.5/s -- -40%
iterative 67.6/s 67% --
Even so a relatively small input and stack suffers.
# my $input = join ' ', (1) x 4, ('+') x 3;
Rate recursive iterative
recursive 14306/s -- -17%
iterative 17271/s 21% --
Just to see how much the evaluation of the RPN expression affected the result I ran the a benchmark against the same data but with non evaluating regexes that use $^R to keep track of the number of numbers seen. That is the first working regex and this one adapted from the iterative version.
qr{
\A \s*
(?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ 1 })
(?>
\s+
(?:
(?<number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ $^R + 1})
|
(?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
|
(?<binary_op> [-+/*^] )
(?{ $^R - 1 })
)
)*
\s* \z
(?(?{ $^R != 1 }) (*FAIL) )
}msx,
# my $input = join ' ', (1) x 1_001, ('+') x 1_000;
Rate recursive iterative
recursive 83.6/s -- -80%
iterative 415/s 396% --
# my $input = join ' ', 1, ('1 +') x 1_000;
Rate recursive iterative
recursive 82.2/s -- -78%
iterative 376/s 357% --
# my $input = join ' ', (1) x 4, ('+') x 3;
Rate recursive iterative
recursive 28249/s -- -69%
iterative 92593/s 228% --
As you can see the difference is quite marked.
Here is a script that tests the regex.
#!//net/perl/5.10.0/bin/perl
use warnings;
use strict;
use 5.010_000;
our @stack;
our $result;
my %operators = (
'+' => sub { $_[0] += $_[1] },
'-' => sub { $_[0] -= $_[1] },
'/' => sub { $_[0] /= $_[1] },
'*' => sub { $_[0] *= $_[1] },
'^' => sub { $_[0]**= $_[1] },
chs => sub { $_[0] = -$_[0] },
abs => sub { $_[0] = abs $_[0] },
sqr => sub { $_[0] *= $_[0] },
sqrt => sub { $_[0] = sqrt $_[0] },
sin => sub { $_[0] = sin $_[0] },
cos => sub { $_[0] = cos $_[0] },
tan => sub { $_[0] = sin $_[0] / cos $_[0] },
);
my $RPN = qr{
(?{ @stack = () })
(?{ $result = undef })
\A \s*
(?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ push @stack, $+{start_number} })
(?>
\s+
(?:
(?<number> \d+ (?: [.] \d* )? | [.] \d+ )
(?{ push @stack, $+{number} })
|
(?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
(?(?{ @stack < 1 }) (*FAIL) )
(?{ $operators{ $+{unary_op} }->($stack[-1]) })
|
(?<binary_op> [-+/*^] )
(?(?{ @stack < 2 }) (*FAIL) )
(?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack)
+})
)
)*
\s* \z
(?(?{ @stack != 1 }) (*FAIL) )
(?{ $result = $stack[0] })
}msx,
my %input = (
'fail' => undef,
'1 fail' => undef,
'123' => 123,
'123.456' => 123.456,
'123.456 +' => undef,
'.456' => .456,
'123.' => 123.,
'123 /' => undef,
' 123 ' => 123,
'123 chs' => -123,
'123 chs chs' => 123,
'123 chs chs chs' => -123,
'123 chs chs chs chs' => 123,
'4 2 ^' => 16,
'4 2 /' => 2,
'4 2 +' => 6,
'4 2 *' => 8,
'4 2 -' => 2,
'12 34+' => undef,
'12 34' => undef,
'12 34 + +' => undef,
'12 34 56 + +' => 102,
'12 34 + 56 +' => 102,
'123 chs 34 *' => -4_182,
'123 34 * chs' => -4_182,
'24' => 24,
'24 abs' => 24,
'24abs' => undef,
'24abssqrt' => undef,
'1 24abs+sqrt' => undef,
'1 24 abs + sqrt' => 5,
'24 chs abs' => 24,
'24 chs' => -24,
'25 abs sqrt' => 5,
'25 chs abs sqrt' => 5,
'24 4 + chs sqr' => 784,
'34 2 / 2 ^' => 289,
'3 34 2 / 2 ^ *' => 867,
);
foreach my $line ( keys %input ) {
my ($expression) = $line =~ /($RPN)/;
if ( !defined $result && !defined $input{$line} ) {
say "OK: $line is invalid, got no result";
}
elsif ( !defined $result ) {
say "FAILED: $line expected $input{$line}, got no result";
}
elsif ( !defined $input{$line} ) {
say "FAILED: $line got $result, expected no result";
}
elsif ( $result == $input{$line} ) {
say "OK: $expression = $result";
}
else {
say "FAILED: $line got $result, expected $input{$line}";
}
}
Final thoughts
- regextidy would make my day.
- A correct BNF is useful even if you can't turn it into a working regular expression.
- The new regular features are powerful.
- I don't yet completely understand how to use them.
- Capturing matches during recursion requires hand crafting.
- Practice makes, if not perfect, then better.
- Recursion is expensive.
- I had fun;-)
- regextidy would really make my day.
Postscript
I started by looking for a good introdutory tutorial for the new features and, when I couldn't find anything, then to write a simple tutorial showing most of them. It didn't turn out that way but at the least I warn others of the pitfalls and, I hope, show a field worth exploring. With help I think I can write such an article so all criticism, tips and references will be appreciated. This is my first long posting so any suggestions regarding presentation are especially welcome.
References