You suggest that the perl re-engine is as powerful as a pushdown
automata. This assumes that we start breaking into use re 'eval'
territory with (?{}) and (??{}).
Then we are not only PDA but also turing complete. If we use the
possibility to include perl code it is trivially turing complete, just
use a TURING_START tag and then run the turing machine in the perl
code.
Thus for the purpose of this little exercise we will not allow ourself
to evaluate anything using (?{}). Only recursive regexs in ??{} blocks.
First of all, if we allow ourself the /g modifier combined with while
we could always do:
my @strings = qw(aaabbbccc abc aaaaabbbbbccccc abbccc);
print "\nAttempt 1\n";
foreach my $str ( @strings ) {
print "$str: ";
$_ = $str;
while (s/^a(a*)b(b*)c(c*)$/$1$2$3/g) {};
print ($_ ? "Rejected" : "Accepted","\n");
}
It should be noted that this parses the grammar
{a^nb^nc^n | n >= 1)
which is beyond the capability of a CFG.
Note that if we allow ourselfs to move into "perlspace" we can just say:
print "\nAttempt 2\n";
my $count;
my $re2 = qr/^ (?{ $count = 0})
(a (?{ $count++ }))*
(??{"b{".$count."}"})
(??{"c{".$count."}"})
/x;
foreach my $str ( @strings ) {
print "$str: ";
print ($str =~ m/$re2/ ? "Accepted" : "Rejected","\n");
}
Altough this says nothing, since perl is turing complete (since
Acme::Ook exists this
is known through the proof that brainfuck is :))
But what if we don't want to allow ourself this? This would still use
??{}. Assuming we allow only a simple recursive regex and not a full
perl statement, once again to avoid moving the parser into
"perl-space". Let make an
attempt:
1. A pushdown automata equivalence
In this case we should settle for trying to establish an equivalence
with a pushdown automata. To do this it would be enough to establish
ourselfs as equivalents of a CFG. This means that we should be able
to parse any Context-free grammar G. A formal definition(1):
G = (V,T,R,S) where: (T is \Gamma).
V - An alphabet (finite set)
T - Terminals (subset of V)
R - Rules a subset of (V-T)xV* and
S - Startsymbol, an element of (V-T)
2. Encoding of a general PDA into a perl regex.
2.1 Assumptions
Assume w.l.o.g that V = \w+.
2.2 Terminals
2.2.1 Creating symbols for terminals
Terminals T are actual 'strings' and can be encoded as trivial
regexps. They should be named $alpha_T so that for instance the
terminal 'a' becomes $alpha_a = qr/a/ and 'b' becomes $alpha_b = qr/b/.
$alpha_T = qr/ T /x;
2.3 Non-terminals
2.3.1 Rules
Each rule r_i should be viewed as the tuple (N,L), where L \subset
V*. To create the rule regex $rule_n you should juxtapose the
letters v1..vn in L like this:
$rule_i = qr/ (??{$alpha_v1}) (??{$alpha_v2}) ... (??{$alpha_vn}) /x;
2.3.2 Creating symbols for non-terminals
This is the set (V-T) and they are represented by the rules R. For
each non-terminal N you should connect all rules (r1..rn) where the
first element is N and then construct the alternating rule:
$alpha_N = qr/ (??{$rule_1}) | (??{$rule_2}) | ... | (??{$rule_n}) /x
+;
2.4 Start symbol
On of the non-terminals should be named the start-symbol and get
special encoding :
$START = $alpha_N;
making the start symbol an alias for it.
2.5 The final regexp
The regexp G accepting the language is simply
$G = qr/ ^ (??{$START}) $ /x;
3 An example language
From (pg 116 in (1)):
Note here the implied \s* after each terminal in T due to a
convention in English, you put spaces between words.
W = {S,A,N,V,P) \union T
T = {Jim, big, green, cheese, ate}
R = { P -> N, P -> AP, S -> PVP, (Rules 1-3)
A -> big, A->green, (Rules 4-5)
N -> cheese, N-> jim, V-> ate} (Rules 6-8)
Encodes into:
# Terminals
$alpha_Jim = qr/ Jim \s* /x;
$alpha_big = qr/ big \s* /x;
$alpha_green = qr/ green \s* /x;
$alpha_cheese = qr/ cheese \s* /x;
$alpha_ate = qr/ ate \s* /x;
# Rules
$rule_1 = qr/ (??{$alpha_N}) /x;
$rule_2 = qr/ (??{$alpha_A}) (??{$alpha_P})/x;
$rule_3 = qr/ (??{$alpha_P}) (??{$alpha_V}) (??{$alpha_P})/x;
$rule_4 = qr/ (??{$alpha_big}) /x;
$rule_5 = qr/ (??{$alpha_green}) /x;
$rule_6 = qr/ (??{$alpha_cheese}) /x;
$rule_7 = qr/ (??{$alpha_Jim}) /x;
$rule_8 = qr/ (??{$alpha_ate}) /x;
# Non-terminals
$alpha_P = qr/ (??{$rule_1}) | (??{$rule_2}) /x;
$alpha_S = qr/ (??{$rule_3}) /x;
$alpha_A = qr/ (??{$rule_4}) | (??{$rule_5}) /x;
$alpha_N = qr/ (??{$rule_6}) | (??{$rule_7}) /x;
$alpha_V = qr/ (??{$rule_8}) /x;
# Start and the language G defined
$START = $alpha_S;
$G = qr/ ^ (??{$START}) $ /x;
#Tests
@strings = ('Jim ate cheese','big Jim ate green cheese',
'big cheese ate Jim',
'big cheese ate green green big green big cheese',
'ate cheese','Jim ate ate', 'Jim ate big big big big chees
+e');
foreach (@strings) {
print "Try: $_ -- ", /$G/ ? "Accepted" : "Rejected", "\n";
}
Conclusion
This is not a complete proof. I think the mapping is complete but I do
not have the time at the moment to prove this. The mapping is also
fairly "trivial", but maybe someone will be amused. If nothing else at
how much spare time I seem to be having.
(1) Elements of the theory of computation, Lewis and Papadimitriou