http://www.perlmonks.org?node_id=494297

japhy has asked for the wisdom of the Perl Monks concerning the following question:

Perhaps some monks here have done something similar to what I'm looking to do. I know there are natural language parsing modules on CPAN, but I want to do the opposite -- I want to take a description of a sentence (like "NOUN.SINGULAR VERB.SINGULAR NOUN.PLURAL", although I know that's probably not specific enough) and, given a table defining words and their parts of speech, construct a random sentence. The goal is to produce a random sentence generator that is sensical; the word bank will be comprised of the words and word phrases found in the "The Onion" Refrigerator Magnet Headline Kit.

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: Natural Language Sentence Production
by ikegami (Patriarch) on Sep 22, 2005 at 21:19 UTC
Re: Natural Language Sentence Production
by GrandFather (Saint) on Sep 22, 2005 at 21:23 UTC

    Haven't you just pretty much described the algorythm? Build tables of words for the various parts of speech, then use templates to select the table to use to select the word for each position in the sentence. If you want to get clever, allow the template to be defined recursively.

    A Super Search may help. If you have absolutely no idea how to go about this, Re: Perl can do it, take 1 (sentence generation) may help.


    Perl is Huffman encoded by design.
      Ah, that code is pretty much what I envisioned doing before I considered using a language parsing module. I guess I'll go that route.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Natural Language Sentence Production
by merlyn (Sage) on Sep 22, 2005 at 22:58 UTC
      Interestingly there is a tool that does exactly the same thing as spew - except with graphics. The terminals are shapes (SQUARE, CIRCLE, etc) rather than character sequences.

      It picks a random weighted path through the grammar to draw the drawing.

      Check it out here: Context Free Design Grammar

      Some of the output is downright spooky

      -Andrew.


      Andrew Tomazos  |  andrew@tomazos.com  |  www.tomazos.com
Re: Natural Language Sentence Production
by Nitsuj (Hermit) on Sep 22, 2005 at 22:19 UTC
    So, there's a lot of literature on natural language... Assuming that you want a bit quicker ramp-up than a few months of reading, I'd do a search on citeseer for data on this, since someone has probably written a paper.

    Failing that, you could scan corpora for sentences that contain the words on the magnets, or you could generate a model from corpora (such as a bayesian one), then constrain your model to only sentences containing those words, and choose sentences with the highest probability.
      I agree with Nitsuj and I'll do a search on Scirus as well...

      First in sentence generation, usually there is a distinction between 'syntax'/'grammar' and 'semantics'. First constraint is that your auto-generated sentences are grammatically correct, second is that they actually mean something sensible. From what I gather from your post, you only need them to be grammatically correct, not meaningful.

      In fact, your initial corpus (magnet elements) need to be mapped (using XML for instance) so each element is described by several attributes (noun/adjective/verb, singular/plural, or even genre if the language you are using makes the distinction between masc./fem./neutr...) Some words, especially prepositions, or prepositional verbs, can create a lot of trouble as to their linkage with other elements. You'll find such grammar descriptive models in publications,for example here : Yet Another Head Driven Generator of Natural Language

      EJ
Re: Natural Language Sentence Production
by sauoq (Abbot) on Sep 22, 2005 at 22:30 UTC

    I don't think this will be any help to you, but I did something like this in lisp once. Only, it worked interactively, the vocabulary was built up from input sentences, and generation rules were generated essentially by building tables describing word proximity in input sentences. I called it "sputter". It was originally based on a program called "henley" which appeared in ANSI Common Lisp by Paul Graham which is a really good book if you want to learn Lisp.

    Come to think of it, it was nothing like what you describe. It was a whole lot of fun though.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Natural Language Sentence Production
by newroz (Monk) on Sep 23, 2005 at 06:39 UTC
    Sean Burkes' Chomsky Bot.
    It shows how to construct sentences and paragraphs from provided chunks of grammatical elements.
Re: Natural Language Sentence Production
by quinkan (Monk) on Sep 23, 2005 at 02:48 UTC
    Well, there's Dev::Bollocks.... Oh. You wanted sensicals. Pity. It gives you management language. Completely different thing.
Re: Natural Language Sentence Production
by DrHyde (Prior) on Sep 23, 2005 at 09:16 UTC
Re: Natural Language Sentence Production
by Anonymous Monk on Sep 22, 2005 at 21:10 UTC
    I know there are natural language parsing modules on CPAN, but I want to do the opposite
    The easiest solution? Guess and check. Generate a sentence of random words and reject those which don't pass muster accoring to you CPAN parsing modules.
      The easiest solution? Guess and check. Generate a sentence of random words and reject those which don't pass mu

      Sounds like a bogosort to me.

      To sort a deck of cards:

      1. Shuffle deck randomly.
      2. If not sorted, go to 1.

      Efficiency in the order of O(lots and lots and lots and....) -Andrew.


      Andrew Tomazos  |  andrew@tomazos.com  |  www.tomazos.com

        Close, but not quite. It depends on the ratio of "sensible" vs. "nonsensical" phrases, which in turn depends on how much leeway you're willing to give the program. If you want to mimic a rational human, then yes, you're right. But if you're willing to take something less stringent - say, your average irrational person, or worse, a suit - then you might be able to pull it off this way.

        I did something like this with a music-composing program. I randomly generated tones, intervals, and durations, and filtered out the obvious dissonances and silly combinations. What was left was OK - not exactly music, but close to Musak. It would have passed muster in an elevator, but not at a concert.

      Er. That probably could have been more clear. Don't generate a long random sentance and then check it, build it up a piece at a time.
      while(...) { my @sentence = (); for (1..$random_sentence_length) { do { $next_word = random_word_generator(); } while(not grammar_correct(@sentance, $next_word); push @sentance, $next_word; } print @sentance; }

      Check out Dave "Pragmatic" Thomas's blog entry Kata14 where you'll be challenged to generate sensical text by using "trigrams"...

      I've tried his solution, and with a varied enough text base to feed it, it generates some impressive texts with so little effort... I guess you can improve it to check longer strings as well, but you'd end up with a Markov Chain (de|re)generator, which could be interesting on itself, but...

      (suggestion: the Gutenberg Project is a great place to get good base texts for feeding the database :-)

      Link: PragDave's Kata Fourteen

      Good luck,

      --
      our $Perl6 is Fantastic;

Re: Natural Language Sentence Production
by mattr (Curate) on Sep 24, 2005 at 18:39 UTC
    You might like to search cpan for Lingua (like Lingua::En::Inflect)and WordNet (like WordNet::SenseRelate::Tools or WordNet::Similarity). The field itself is pretty large, it is natural language processing (NLP), or computational linguistics, and you are talking about "sentence generation". But it sounds like you don't really want to get that deeply into it. If you are careful to limit what can be selected into each field it may sound realistic. Incidentally you might be interested in ALICE.
Re: Natural Language Sentence Production
by casiano (Pilgrim) on Jan 14, 2009 at 09:58 UTC
    Just in case it can help to people trying to solve a similar problem:

    Probably yagg is the righ tool for that.

    Though Parse::Eyapp was conceived for parsing, versions 1.137 and later provide support to build a phrase generator from a grammar specification. If you want to know more, read the tutorial Parse::Eyapp:::datagenerationtut. The example used produces sequences of assignment statements:

    Parse-Eyapp/examples/generator$ ./Generator.pm # result: -710.2 I=(3*-8+7/5); R=2+8*I*4+5*2+I/I
    To specify the language we write a yacc-like grammar, but instead of writing the classical lexer, i. e. scanning the input to produce the next token, we write a token generator: Each time our lexical analyzer is called, it checks the list of expected tokens (available via the method YYExpect) and produces - following some probability distribution - one of them. This is the grammar for the calculator:
    Parse-Eyapp/examples/generator$ cat -n Generator.eyp 1 # file: Generator.eyp 2 # compile with: eyapp -b '' Generator.eyp 3 # then run: ./Generator.pm 4 %strict 5 %token NUM VARDEF VAR 6 7 %right '=' 8 %left '-' '+' 9 %left '*' '/' 10 %left NEG 11 %right '^' 12 13 %defaultaction { 14 my $parser = shift; 15 16 return join '', @_; 17 } 18 19 %{ 20 use base q{Parse::Eyapp::TokenGen}; 21 use base q{GenSupport}; 22 %} 23 24 %% 25 26 stmts: 27 stmt 28 { # At least one variable is defined now 29 $_[0]->deltaweight(VAR => +1); 30 $_[1]; 31 } 32 | stmts ';' { "\n" } stmt 33 ; 34 35 stmt: 36 VARDEF '=' exp 37 { 38 my $parser = shift; 39 $parser->defined_variable($_[0]); 40 "$_[0]=$_[2]"; 41 } 42 ; 43 exp: 44 NUM 45 | VAR 46 | exp '+' exp 47 | exp '-' exp 48 | exp '*' exp 49 | exp '/' exp 50 | '-' { $_[0]->pushdeltaweight('-' => -1) } 51 exp %prec NEG { 52 $_[0]->popweight(); 53 "-$_[3]" 54 } 55 | exp '^' exp 56 | '(' { $_[0]->pushdeltaweight( '(' => -1, ')' => +1, '+' => +1, ); } 57 exp 58 ')' 59 { 60 $_[0]->popweight; 61 "($_[3])" 62 } 63 ; 64 65 %% 66 67 unless (caller) { 68 __PACKAGE__->main(@ARGV); 69 }
    The difficult part is the management of the probability distribution to produce reasonable phrases and to avoid very long statements. The generation of tokens and its attributes uses Test::LectroTest::Generator. The support subroutines have been isolated in the module GenSupport.pm (see http://cpansearch.perl.org/src/CASIANO/Parse-Eyapp-1.137/examples/generator/GenSupport.pm ).