|Perl: the Markov chain saw|
Comment onby gods
|on Feb 11, 2000 at 00:06 UTC||Need Help??|
As you can tell from my home node, I've taken an interest in this kind of thing. Most of my work is in phonology, so you'll have to excuse me if things point in that direction. I'm sure you've already thought of most of this, but I wanted to lay it out.
The main thing that occurs to me is the danger of assuming 1) that the root is the longest element of a word, and that 2) there is only one root and one affix.
There are plenty of languages (i.e. Basque, Russian, probably even English, though I can't think of examples) that have morphemes with more sounds than the root. I'll find some examples later when I have all my dictionaries around me :)
There are plenty of languages (every one that I can think of) that allow compounding of roots, and much prepending/appending of affixes. Basque in particular allows many morphemes to be attached to a given word (I think it can get up to 6).
I hope you don't have to deal with this, but you may have to consider circumfixes (one single morpheme that has parts before and after the root, like German past tenses e.g. 'ge-mach-t') and infixes (morphemes inserted into the root, the only example I can think of being the old Fish Called Wanda 'unbe-f**n-lievable').
This leads me to encourage supplying the engine with a many-to-many set of words. Use the same root with different affixes, but also use the same affixes with different roots.
Of course, your problem set is probably reduced to a single family of languages, so maybe you won't have to take all this into consideration, but these are the sorts of questions I had immediately.
This is definitely a very studied problem, and though I think it can be solved for small situations and small data sets with relative ease, I'd encourage research into what Carnegie-Mellon, the University of Edinburgh, and the University of Texas have done in this direction.
Finally, if you're going to work with English, you'll need to write everything phonetically. For example, The silent 'e' that gets deleted when adding a suffix that begins with a vowel may become a problem ('believ-able'). Once again, if you've done linguistics for ten minutes you know what a chore anything in English is.
Hope this isn't too much. Good luck with this. I'd like to hear more about it if you get some good stuff working.
update: Turkish! Cool! IPA is definitely the way to go, but the problem is: which IPA? Can you get the stuff to work in Unicode? If you can, you can do all sorts of normal pattern matching (regex) using Perl 5.6. If you only use Sil, I'm sure there's still a way to do it, but it may be more difficult. That's one of the principle things I'm working on (a bridge between Sil and Unicode), but haven't quite done yet.
If I ever end up getting all my stuff done, we may be able to correspond on some of this stuff. Hope I wasn't overly cautionary there.