darksym has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks, yeanling & bantling:

After looking over CPAN and doing some WAITing, I'm wondering if I missed out on the big blinking sign that read "English stuff is here!", or something. Is there a unified module that offers a wide variety of English primitives and transforms for Natural Language Processing? For instance, is there something like Text::English, but more extensive? If not, I may be interested in adding my code to somewhere appropriate in the tree as a starting point. So far no word from the author of Text::English.

I'm doing a small contract that requires some auto-correlation and such...

Text::English::stem has been invaluable. Thanks Martin Porter, implementors, and others! I've also been thinking of taking advantage of some of the lists at to hammer out some facilities for future English nightmares.

On an unrelated note, did you know that only a few special places on the web have the following word sequence according to Google: "Bring King Ling ring Bing Ding Sing spring swing" (Wow, The Phonosemantics of Nasal-Stop Clusters and other music hits.). Can you think of the longest such a m/[a-z]+ing/ match which presumably will trip up Porter's Stemmer (where length > 5)? The common thing here is that the ugly duckling word isn't a stemmable -ing string where that is suppose to cling unlike the word 'spelling'.

Please help me find wordlists that detail English word relationships or other cool language algorithms (I'm no linguist). Thanks my darlings... (And don't go flinging your dumplings at the poor cageling! =] )

P.S. See: Martin's Official PorterStemmer page, for more info on stemmers.

Replies are listed 'Best First'.
Re: Status of English modules...
by cjf (Parson) on Mar 30, 2002 at 07:44 UTC

    A search for Lingua on CPAN turns up a lot of language-related modules.

    The module that lead me to do the search a while ago was TheDamian's Lingua::EN::Inflect which looks really neat. The various language modules are quickly becoming some of my favorites on CPAN.

      Oh yeah.. Lingua::EN.. well now I feel dumb, maybe I wasn't searching for specifics. Still reading about these modules, I'm inspired to help out. I had a few ideas as far as word comparison. I'm not academically qualified as a linguist, but, there seems to be two different type of word roots: ones that are in our common lexicon and ones that are deeper into the etymological roots of the language.

      Now how to find the roots without going too deep and how to distinguish between combining forms and regular suffixes/prefixes is the problem, since it isn't an exact science (at least not in it's simplest form)... quasi- is a combining form, not a prefix; mis- and anti- are prefixes. Then there are noun and verb combining forms vs. suffixes.

      What a mess, who designed this language anyway? Gee, thanks a lot human history.. for providing the clean cut and well thought out natural languages that we have today. Talk about taking a legacy to the extreme! Sheesh. :)
Re: Status of English modules...
by ejf (Hermit) on Mar 30, 2002 at 16:29 UTC

    Please have a look at WordNet and the various (WordNet::QueryData, Lingua::Wordnet::Analysis, ...) Perl Modules for utilizing it ... There are LOTS of relationships in there, and you can do some pretty cool things with it. Also a great resource if English isn't your native language :)