justinNEE has asked for the wisdom of the Perl Monks concerning the following question:
Would return something like:Data: baSlar,heads BaSlarimiz,our heads baSimda,in my head
(or instead of 'in my - ' it would return a description.) These observations may not be true for the language, but they are true for the data that we have. When rules contradict eachother the program might look at the data closer to see if the rule is more complex, or it might decide that since the occurance of the rule is once out of x times, it is an exception, or that since two rules occur 50% each, they are both acceptable. The word lists would generally be around 100-200 entries... I'll try to get a bigger sample to play with tomorrow. I read the article in tpj #17 and while it was interesting, I still don't know where to start...baS,head -lar,inflectional:plural -imiz,our - -imda,in my -
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Perl and Morphology
by ChOas (Curate) on Mar 14, 2002 at 11:27 UTC | |
Think this might be a start ? .. only tested on your data: ;))
It`s probably not the cleanest code, but it was fun to do, so I thought I`d post it :))) I figure you can use this as a start to really solve your problem, it shouldn`t be too hard to once you got the Root, take the substring out of the hash, and start looking for the next one (using that FindRoot sub) which would give you `lar` , making the next root `BasLar` ... Hmmmmm... actually I think I got that to work here... Looking forward to more test-data :)) GreetZ!,
print "profeth still\n" if /bird|devil/; | [reply] [d/l] |
Re: Perl and Morphology
by ViceRaid (Chaplain) on Mar 14, 2002 at 17:57 UTC | |
update:Realised this looks long. It is quite long. The poster above has given you a very fine answer, but it's intended for quite a limited set of cases, all from the same root. I've tried to go a little bit deeper to identify smaller morphological elements in the words, abstractly, hence this is wordy. ++ interesting question. I've been playing around with it on the sly for an hour or two, but a project manager keeps coming over and asking me why his website's feature boxes are still broken, so I haven't got a complete answer for you, just some suggestions, which might be helpful or otherwise To start with, I think you might need to give your programme a few more hints to try and get it to analyse your data. At the moment, you're giving it a bare english translation, and expecting it to be able to identify grammatical elements that might correspond to morphological elements in the original. Instead, you might make it a lot easier if you pre-analyse each instance into the grammatical parts that make it up. What I'm thinking you might end up with is a data structure that looks like this:
You could identify any number of syntactic features in a given word this way: verbal moods or aspects, nominal cases, numbers or genders. I'm not sure whether each data set you're working on will come from the same root or not, but I'll assume they do (it's not a huge problem if they don't, though**). Then, extract the root using something like the subroutine supplied by the previous poster, or whatever:
This will give you a set of strings that are groups of morphemes (the words without the roots). For each of these strings, you know it's got to contain a set number of individual morphemes representing the grammatical features. Eg.
Assuming each of the grammatical elements is represented by a non-null string morpheme, there's a limited number of ways that 'imda' can indicate 'in my head'. You could generate all these permutations (this is where I got hassled and had to stop coding ... so this is broken:)
And then you'd have a set of guesses at the ways in which the suffix could be representing the grammatical form:
You should be then able to cross reference all the different cases that you have for 'locative' or 'plural' or 'possessed by me/us', and see which permutations are true for all the different cases. Of course, this is a slightly 'brute force' method of approaching this problem, and the results are still likely to need some interpretation; however, it could save a lot of manual guessing. Having some knowledge of the phonemics of the language, or knowing one or two of the morphemes in advance is likely to make it A LOT easier. Of course, all this assumes that your morphemes are all suffixes, not prefixes, and that there isn't anything tricksy like sandhi taking place between suffixes. But it might be a start for you. Have fun /=\ **update: it occurs to me that it doesn't really matter if you pre-strip the root at all. Instead, you could skip that step altogether, and just identify the root as another grammatical element, eg:
| [reply] [d/l] [select] |
Re: Perl and Morphology
by ronald (Beadle) on Mar 14, 2002 at 22:39 UTC | |
You don't say whether your goal is to learn about parsing or simply to be able to parse Turkish words. If the latter, you can save yourself a lot of time and effort by using the parser available at: http://www.nlp.cs.bilkent.edu.tr/cgi-bin/tmanew You get results like the following for 'baSlar':
You can use Perl to submit words for parsing and then map the results onto English. You'll also need to preprocess the words to apply some phonological rules, like vowel harmony. For example, you won't get any results from 'baSimda' and have to submit as 'baSImda' instead. You could apply the harmony rules with s///, though if you apply harmony to all words, it will apply incorrectly to disharmonic roots and non-harmonizing suffixes. It's pretty hard to avoid that problem without first having the morphological parse! ronald | [reply] [d/l] |
Re: Perl and Morphology
by Maestro_007 (Hermit) on Mar 14, 2002 at 20:49 UTC | |
As you can tell from my home node, I've taken an interest in this kind of thing. Most of my work is in phonology, so you'll have to excuse me if things point in that direction. I'm sure you've already thought of most of this, but I wanted to lay it out. Some considerations: The main thing that occurs to me is the danger of assuming 1) that the root is the longest element of a word, and that 2) there is only one root and one affix. There are plenty of languages (i.e. Basque, Russian, probably even English, though I can't think of examples) that have morphemes with more sounds than the root. I'll find some examples later when I have all my dictionaries around me :) There are plenty of languages (every one that I can think of) that allow compounding of roots, and much prepending/appending of affixes. Basque in particular allows many morphemes to be attached to a given word (I think it can get up to 6). I hope you don't have to deal with this, but you may have to consider circumfixes (one single morpheme that has parts before and after the root, like German past tenses e.g. 'ge-mach-t') and infixes (morphemes inserted into the root, the only example I can think of being the old Fish Called Wanda 'unbe-f**n-lievable'). This leads me to encourage supplying the engine with a many-to-many set of words. Use the same root with different affixes, but also use the same affixes with different roots. Of course, your problem set is probably reduced to a single family of languages, so maybe you won't have to take all this into consideration, but these are the sorts of questions I had immediately. This is definitely a very studied problem, and though I think it can be solved for small situations and small data sets with relative ease, I'd encourage research into what Carnegie-Mellon, the University of Edinburgh, and the University of Texas have done in this direction. Finally, if you're going to work with English, you'll need to write everything phonetically. For example, The silent 'e' that gets deleted when adding a suffix that begins with a vowel may become a problem ('believ-able'). Once again, if you've done linguistics for ten minutes you know what a chore anything in English is. Hope this isn't too much. Good luck with this. I'd like to hear more about it if you get some good stuff working. MM update: Turkish! Cool! IPA is definitely the way to go, but the problem is: which IPA? Can you get the stuff to work in Unicode? If you can, you can do all sorts of normal pattern matching (regex) using Perl 5.6. If you only use Sil, I'm sure there's still a way to do it, but it may be more difficult. That's one of the principle things I'm working on (a bridge between Sil and Unicode), but haven't quite done yet. If I ever end up getting all my stuff done, we may be able to correspond on some of this stuff. Hope I wasn't overly cautionary there. | [reply] |
by Anonymous Monk on Sep 12, 2007 at 12:41 UTC | |
| [reply] |
Re: Perl and Morphology
by justinNEE (Monk) on Mar 14, 2002 at 20:37 UTC | |
update: Justin still doesn't know how to use perlmonks.org... anyway, how am I supposed to pass around this file? I don't suppose this is the right way: encoded text file | [reply] |