Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

natural language sentence construction

by thpfft (Chaplain)
on Jun 17, 2001 at 15:03 UTC ( #89148=perlquestion: print w/ replies, xml ) Need Help??
thpfft has asked for the wisdom of the Perl Monks concerning the following question:

Most of my work involves interface-building, and I've found that once the logic and presentation are reasonably together, the biggest factor in the success or failure of the system is the tone and clarity with which it addresses the user.

So i've become obsessed with writing ui scripts that use proper colloquial english. People react so much better to a page that tells them what's going in a conversational way that i almost don't mind bloating the scripts with sentence construction code and obfuscating the templates with grammatical conditionals.

Here's a very simple example, dug out of the middle of a script i'm updating at the moment. The end result is a sentence in the form:

We found 27 campaign updates and case studies relevant to children and young people, death penalty and the Americas.

except sometimes it's only one type of document, or no restriction at all, or only one keyword, only two results, and so on. There are dozens of permutations, and the proper way of describing the situation is different each time in small but vital ways. This excerpt is as close as I've come without spelling everything out:

my @records = qw(1 2 3 4 5 6); @input::id = qw(id1 id2 id3 id4); @input::type = qw(document person); print summarise(\@input::id, \@input::type, scalar(@records)); exit; sub summarise { my ($ids,$types,$matches) = @_; my $sentence = 'We found '; $sentence .= $matches || 'no'; if (@$types) { foreach my $i (0..$#$types) { if ($i && $i == $#$types) { $sentence .= ($matches > 1) ? ' and ' : ' or '; } elsif ($i) { $sentence .= ', '; } # document types are in the database # with singular and plural forms of their title # but i've skipped that part here $sentence .= qq| <a href="link">$$types[$i]</a>|; } } else { $sentence .= "item"; $sentence .= "s" if ($matches > 1); } $sentence .= ' relevant to '; $sentence .= 'both ' if (@$ids == 2); $sentence .= 'all of ' if (@$ids > 2); foreach my $i (0..$#$ids) { if ($i && $i == $#$ids) { $sentence .= ' and '; } elsif ($i) { $sentence .= ', '; } # keyword titles also looked up from the database really. $sentence .= qq|<a href="link">$$ids[$i]</a>|; } return $sentence; }

If anyone is interested enough to make this more elegant - or just play golf with it - i'd be much obliged.

but my main question: is there a module or project that'll do some of this work for me? CPAN yields a lot of stemming and other mechanisms designed to make words more friendly to computers, but not much designed to make them more friendly to people.

If there isn't any such module, i'd like to start building one. I imagine something extensibly rule-based with a relatively small number of abstract construction mechanisms for common sentence forms, and a vocabulary of prepositions and articles and so on. Ideally swappable into languages other than English, one day. Any views about feasibility or functionality?


Comment on natural language sentence construction
Download Code
Re: natural language sentence construction
by Arguile (Hermit) on Jun 17, 2001 at 15:41 UTC

    While I don't know much about natural language parsing -- many people will tell you *I* need a better grammatic parser unit -- you might want to check out Alice. Alice is the winner of the 2000 Loebner Prize which is a Turing Test.

    The basis of a Turing test is that a subject, talking (currently text based) with a human and computer, not be able to determine which is which. While the majority of the code would be of only academic interest to you, the AI must intelligently construct complex yet grammatically correct sentences; so it follows that it would have some sort of "reverse parser". I don't know how tightly coupled the grammatical ruleset and sentence constructors are to the rest of the code base, but you might want to give it a look -- even if only for ideas. The code is freely available here.

    More on Alan Turing sometimes dubbed "The Founder of Computer Science"
Re: natural language sentence construction
by ariels (Curate) on Jun 17, 2001 at 16:05 UTC
    Conway's Lingua::EN::Inflect deals with pluralising (and the `a'/`an' distinction) for English.

    I'm not sure how you'd internationalise your code to make it "swappable into languages other than English". You'd need a very good understanding of linguistics; some parts of the sentence that remain fixed in English could change drastically for other languages.

Re: natural language sentence construction
by dimmesdale (Friar) on Jun 17, 2001 at 20:10 UTC
    My view about feasibility is near impossible. Well, depending on what you want to do. Parsing natural language and attempting to make it more "friendly" to your users is (a)highly unique to your particular application, and (b)requires many, *many*, special case situations. There is a CPAN module for making a word its plural, and that can be useful. . . but as for parsing an english sentence, that is near impossible with current technology--if only for the reason that language is abigous, even to native speakers at times(how often do people ask others to repeat what they said?). There was a node on this a short time ago, I believe, when someone asked if there was a module for grammar checking: called English/Grammer or something similar if I remember. As for helping you with your code, there is one area where I see it could be made clearer:
    foreach my $i (0..$#$types) { if ($i && $i == $#$types) { $sentence .= ($matches > 1) ? ' and ' : ' or '; } elsif ($i) { $sentence .= ', '; }
    This can be written more clearer(I hope) as:
    for(@$types) { if($_ eq $types->[-1]) { $sentence .= $matches>1 ? ' and ' : ' or ' } else { $sentence .= ', ' }
    Maybe it's just me, but it took a couple looks to see why you were doing what you were doing. This way, hopefully, the -1 signifies that you're checking it against the last element. But, there's more than one way to do it.
Re: natural language sentence construction
by toma (Vicar) on Jun 17, 2001 at 22:31 UTC
    Syntax analysis is still an area of active research in linguistics. The language used is often lisp, but Perl is probably also a reasonable choice. Your problem is slightly easier than the corresponding 'grand challenge' problem, which is understanding natural language. In general, it is easier to transmit than to receive.

    Tree data structures often used in natural language. Chapter 8 of Mastering Algorithms in Perl is about tree data structures, which are handled by the graph modules.

    It looks like there is an AI dictionary that could be helpful in creating a useful syntax tree. You might be able to design a data structure that works like a subset of this dictionary.

    Look at Damian Conway's wonderful Coy module, which is a haiku generator which is designed to create human-friendly messages. Although you aren't trying to generate haiku, you may be able to use similar programming techniques.

    I think it is justified to attempt to solve such large, open problems, even if the chance of a breakthrough is small. It is a great learning experience, and you just might make a real contribution. You can also think about potential technology advances to look for in the future, so that you might be the first person to apply a newly-available technique to a difficult problem.

    It should work perfectly the first time! - toma

      Thanks for the links. Especially coy. delightful.

      I agree completely with your quixotic implication: my efforts in this direction so far have proceeded on two fronts: recursive sentence-building routines based on Chomskyan deep structure rules, all of which have failed terribly, and really simple special case routines like the one above, which makes a perfectly readable sentence in a very dull way.

      My background is philosophy of language rather than straight linguistics, but i've got enough of a grasp to see the scale of the problem. I don't think it's completely hopeless, as long as it's properly constrained. The content-management systems that i'm trying to make more articulate are a good place to start: they have a very limited world, their utterances fall into a few well-defined categories, they're almost always declarative, and the goal is transparency, not lyricism.

      to start with, i'd like to identify a small set of phrases that people use all the time. that's why i used the results example above, which everyone must need at some point. Other examples might include pagination links, error messages and confirmation questions, but i'm hoping people will make suggestions. Then i'd like to implement that limited set in as general a way as possible, and take it from there.

      So it's a fairly limited ambition, really. The fact that i'm using it as a spur both to learn OOP properly and finally read the algorithm book should give you some idea of how long it's likely to take :(

      updated: silly typo

Re: natural language sentence construction
by John M. Dlugosz (Monsignor) on Jun 18, 2001 at 05:26 UTC
    I hated my COMMAND.COM shell for saying something like "42 file(s)", and patched my copy to just say "files". Would not be hard to conditionally print "file" as needed! So as for constructing proper language in a User Interface, great job!


Re: natural language sentence construction
by mattr (Curate) on Jun 18, 2001 at 12:35 UTC
    I think you will have more success if you aim at clearly defining a limited domain and a single use-case, especially if you stick to text report generation and not go into interactive feedback which requires language parsing. You don't need parsing at all if you just want to be able to express system status in English.

    Besides defining nouns for system objects and working out the kinds of messages that can be provided, you also need to work on prioritizing what gets told to the user so they don't end up with ten pages of unimportant information when there are one or two really important things that need to be told (emailed) to them. More difficult heuristics are just that, way more difficult.. but if you can provide even a small very minimal AI engine it would be quite useful to people.

    Perhaps AI::Fuzzy would be useful in figuring out the best way to express values which are not clearly defined.

    You may know about this from CPAN:
    Lingua::Wordnet - Perl extension for accessing and manipulating Wordnet databases.

    As far as parsing goes, there are also some projects like these from
    - a natural language parser
    - another parser
    - Linguistic Data Corp has released open source code

    But maybe what you really need is CLIPS, an expert system used by NASA and other government institutions. You can get the source code.. a Perl front end to this for even a limited problem space would be very cool! (Hint, Hint..)

    "CLIPS is a productive development and delivery expert system tool which provides a complete environment for the construction of rule and/or object based expert systems... CLIPS provides a cohesive tool for handling a wide variety of knowledge with support for three different programming paradigms: rule-based, object-oriented and procedural. Rule-based programming allows knowledge to be represented as heuristics, or "rules of thumb," which specify a set of actions to be performed for a given situation. Object-oriented programming allows complex systems to be modeled as modular components (which can be easily reused to model other systems or to create new components). The procedural programming capabilities provided by CLIPS are similar to capabilities found in languages such as C, Pascal, Ada, and LISP."

    Edit 2001-06-18 13:08 ar0n Made the links clickable

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://89148]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-07-29 08:52 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (212 votes), past polls