http://www.perlmonks.org?node_id=308283

timothy has asked for the wisdom of the Perl Monks concerning the following question:

Hi,I'm interested in Random Text Generation using Context Free Grammar with Perl. I'm struggling to find any relevant literature or research to date. Also are there any codes and programs around that I can look at. Can anyone point me in the right direction?

Replies are listed 'Best First'.
Re: Perl and Context Free Grammar
by Abigail-II (Bishop) on Nov 19, 2003 at 13:02 UTC
    Most of the CFG stuff you'll see in relation to Perl is about parsing, not generating. And then, most of the time, CF will not be mentioned. Regular expression can be used to parse CFG's although it's not clear (yet?) whether they are powerful enough to parse all CFGs (I believe they do, but I've not even started proving it yet). And then there's Parse::RecDescent. Also for parsing, but if you use the right "code blocks", you might be able to do some generating stuff as well.

    Abigail

      Please make a distinction between regular expressions as known in computer science literature and Perl regular expressions. I know someone like you knows, but many people could get very wrong ideas from a statement such as the above.

      Regular expressions in the computer science sense have a subset of the operators Perl regular expressions have, i.e. concatenation, union, kleene star.

      Regular expressions describe regular languages, context free grammars describe context free languages and it is known (and fairly easy to prove) that reguular languages are a proper subset of context free languages. Hence you can't parse context free language with a regular expression unless it happens to be a regular language.

      Perl regular expressions are more powerful than computer science regular expressions since they've features such as capturing and \1, zero-width assertions and code embedding. It is indeed an open problem what the precise expressive power is.

      Sorry for this piece of pedantry, but IMHO it's an important point to make when addressing a very general audience.

      Just my 2 cents, -gjb-

      A push-down automata <sp?> (or stack machine) is required to read a Context Free Grammer. The Perl "Regular expression engine" is a stack machine due to the way it processes expressions in () and (?...). I would argue (and take the time to flesh out my logic) that the Perl RE Engine is actually a CFG engine.
        It's not able to parse CFG grammers due to it being a stack machine so it can do backreferencing. I believe it can parse CFGs due to its ability to do "delayed regexes" - the (??{ }) construct. 5.004 for instance also uses a stack to parse regexes, but there's no way you can match balanced parens with a 5.004 regex. But you can make a CFG to match balanced parens.

        Abigail

      You suggest that the perl re-engine is as powerful as a pushdown automata. This assumes that we start breaking into use re 'eval' territory with (?{}) and (??{}). Then we are not only PDA but also turing complete. If we use the possibility to include perl code it is trivially turing complete, just use a TURING_START tag and then run the turing machine in the perl code. Thus for the purpose of this little exercise we will not allow ourself to evaluate anything using (?{}). Only recursive regexs in ??{} blocks. First of all, if we allow ourself the /g modifier combined with while we could always do:
      my @strings = qw(aaabbbccc abc aaaaabbbbbccccc abbccc); print "\nAttempt 1\n"; foreach my $str ( @strings ) { print "$str: "; $_ = $str; while (s/^a(a*)b(b*)c(c*)$/$1$2$3/g) {}; print ($_ ? "Rejected" : "Accepted","\n"); }
      It should be noted that this parses the grammar {a^nb^nc^n | n >= 1) which is beyond the capability of a CFG. Note that if we allow ourselfs to move into "perlspace" we can just say:
      print "\nAttempt 2\n"; my $count; my $re2 = qr/^ (?{ $count = 0}) (a (?{ $count++ }))* (??{"b{".$count."}"}) (??{"c{".$count."}"}) /x; foreach my $str ( @strings ) { print "$str: "; print ($str =~ m/$re2/ ? "Accepted" : "Rejected","\n"); }
      Altough this says nothing, since perl is turing complete (since Acme::Ook exists this is known through the proof that brainfuck is :)) But what if we don't want to allow ourself this? This would still use ??{}. Assuming we allow only a simple recursive regex and not a full perl statement, once again to avoid moving the parser into "perl-space". Let make an attempt:

      1. A pushdown automata equivalence

      In this case we should settle for trying to establish an equivalence with a pushdown automata. To do this it would be enough to establish ourselfs as equivalents of a CFG. This means that we should be able to parse any Context-free grammar G. A formal definition(1):
       G = (V,T,R,S)  where: (T is \Gamma).
          V - An alphabet (finite set)
          T - Terminals (subset of V)
          R - Rules a subset of (V-T)xV* and
          S - Startsymbol, an element of (V-T)
      

      2. Encoding of a general PDA into a perl regex.

      2.1 Assumptions

      Assume w.l.o.g that V = \w+.

      2.2 Terminals

      2.2.1 Creating symbols for terminals

      Terminals T are actual 'strings' and can be encoded as trivial regexps. They should be named $alpha_T so that for instance the terminal 'a' becomes $alpha_a = qr/a/ and 'b' becomes $alpha_b = qr/b/. $alpha_T = qr/ T /x;

      2.3 Non-terminals

      2.3.1 Rules

      Each rule r_i should be viewed as the tuple (N,L), where L \subset V*. To create the rule regex $rule_n you should juxtapose the letters v1..vn in L like this:
      $rule_i = qr/ (??{$alpha_v1}) (??{$alpha_v2}) ... (??{$alpha_vn}) /x;

      2.3.2 Creating symbols for non-terminals

      This is the set (V-T) and they are represented by the rules R. For each non-terminal N you should connect all rules (r1..rn) where the first element is N and then construct the alternating rule:
      $alpha_N = qr/ (??{$rule_1}) | (??{$rule_2}) | ... | (??{$rule_n}) /x +;

      2.4 Start symbol

      On of the non-terminals should be named the start-symbol and get special encoding :
      $START = $alpha_N;
      making the start symbol an alias for it.

      2.5 The final regexp

      The regexp G accepting the language is simply $G = qr/ ^ (??{$START}) $ /x;

      3 An example language

      From (pg 116 in (1)):
      Note here the implied \s* after each terminal in T due to a convention in English, you put spaces between words.
       W = {S,A,N,V,P) \union T
       T = {Jim, big, green, cheese, ate}
       R = { P -> N, P -> AP, S -> PVP,     (Rules 1-3)
             A -> big, A->green,            (Rules 4-5) 
             N -> cheese, N-> jim, V-> ate} (Rules 6-8)
      
      Encodes into:

      # Terminals

      $alpha_Jim = qr/ Jim \s* /x; $alpha_big = qr/ big \s* /x; $alpha_green = qr/ green \s* /x; $alpha_cheese = qr/ cheese \s* /x; $alpha_ate = qr/ ate \s* /x;

      # Rules

      $rule_1 = qr/ (??{$alpha_N}) /x; $rule_2 = qr/ (??{$alpha_A}) (??{$alpha_P})/x; $rule_3 = qr/ (??{$alpha_P}) (??{$alpha_V}) (??{$alpha_P})/x; $rule_4 = qr/ (??{$alpha_big}) /x; $rule_5 = qr/ (??{$alpha_green}) /x; $rule_6 = qr/ (??{$alpha_cheese}) /x; $rule_7 = qr/ (??{$alpha_Jim}) /x; $rule_8 = qr/ (??{$alpha_ate}) /x;

      # Non-terminals

      $alpha_P = qr/ (??{$rule_1}) | (??{$rule_2}) /x; $alpha_S = qr/ (??{$rule_3}) /x; $alpha_A = qr/ (??{$rule_4}) | (??{$rule_5}) /x; $alpha_N = qr/ (??{$rule_6}) | (??{$rule_7}) /x; $alpha_V = qr/ (??{$rule_8}) /x;

      # Start and the language G defined

      $START = $alpha_S; $G = qr/ ^ (??{$START}) $ /x;

      #Tests

      @strings = ('Jim ate cheese','big Jim ate green cheese', 'big cheese ate Jim', 'big cheese ate green green big green big cheese', 'ate cheese','Jim ate ate', 'Jim ate big big big big chees +e'); foreach (@strings) { print "Try: $_ -- ", /$G/ ? "Accepted" : "Rejected", "\n"; }

      Conclusion

      This is not a complete proof. I think the mapping is complete but I do not have the time at the moment to prove this. The mapping is also fairly "trivial", but maybe someone will be amused. If nothing else at how much spare time I seem to be having.

      (1) Elements of the theory of computation, Lewis and Papadimitriou

        Ah yes, and I was actually mad enough to make a webpage out of this. http://www.dunata.com/science/compsci/perl-re.php. Amusing might be my little program that parses BNF-like syntax and creates a program that accepts strings belonging to that syntax. Oh the things we do to avoid working.
Re: Perl and Context Free Grammar
by blokhead (Monsignor) on Nov 19, 2003 at 14:11 UTC
    If you're dealing with CFGs in the strict theoretic sense, then there's an obvious naïve way to generate random strings: start with the starting nonterminal, and apply random productions until you are down to all terminals. The drawback being that with some grammars this may take forever (literally or figuratively) to remove all non-terminals. Here's one that generates strings with the same number of a's and b's:
    my %cfg = ( S => [qw/aB bA/], A => [qw/aS bAA a/], B => [qw/bS aBB b/] ); my $string = 'S'; my $regex = join "|" => keys %cfg; print "$string\n" while $string =~ s/($regex)/ $cfg{$1}[ rand @{$cfg{$1}} ] /e; print "$string\n";
    There are better algorithms out there -- A quick googling found this paper which does a good job of explaining why the above method isn't that great, and presents a much better algorithm. Either of these may make a good starting point for your project.

    blokhead

Re: Perl and Context Free Grammar
by gjb (Vicar) on Nov 19, 2003 at 14:56 UTC

    I just remembered that I actually have Perl code doing this. It's from a research project, so there's not much comment and it's dashed off quickly so I'm sure it can be optimized a lot.

    The code is intended to generate all strings described by the CFG upto a given length. In your case, you'll probably want to include some probabilities when choosing alterantives.

    Hope this helps, -gjb-

Re: Perl and Context Free Grammar
by Molt (Chaplain) on Nov 19, 2003 at 13:36 UTC
    You may want to have a look at Parse::RandGen. Not used it myself so can't comment on it's reliability etc but from the documentation it looks ideal for your purpose.
Text generation with Perl
by TheDamian (Vicar) on Nov 19, 2003 at 19:22 UTC
Re: Perl and Context Free Grammar
by mattr (Curate) on Nov 20, 2003 at 06:32 UTC
    Hello,

    Certainly the links other people have posted are probably what you wanted, but maybe there is some more information in here.

    Unfortunately I don't have a degree in computational linguistics, though lately I think it would be nice to have! I've been scouring the web for natural language processing tools to use along with perl for real-world application. Possibly these links may be of use, at least in my limited understanding some links I saw which say they use lexical functional grammars in text generation may be pertinent.

    A list of projects for algorithmic sentence generation I came across yesterday. Some detective work required.

    A number of links under shallow and deep text generation (not from a CFG exactly) at the Natural Language Software Registry. FGW (Functional Grammar Workbench). functional grammar homebase and links to functional discourse grammar.

    FUF / SURGE system as mentioned on thispage. From Ben Gurion, and listed in the NLSR above.

    A short, basic page for other people, with pseudocode discussing the generation of strings of terminal symbols from a CFG. Matt R.