Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

As you can tell from my home node, I've taken an interest in this kind of thing. Most of my work is in phonology, so you'll have to excuse me if things point in that direction. I'm sure you've already thought of most of this, but I wanted to lay it out.

Some considerations:

The main thing that occurs to me is the danger of assuming 1) that the root is the longest element of a word, and that 2) there is only one root and one affix.

There are plenty of languages (i.e. Basque, Russian, probably even English, though I can't think of examples) that have morphemes with more sounds than the root. I'll find some examples later when I have all my dictionaries around me :)

There are plenty of languages (every one that I can think of) that allow compounding of roots, and much prepending/appending of affixes. Basque in particular allows many morphemes to be attached to a given word (I think it can get up to 6).

I hope you don't have to deal with this, but you may have to consider circumfixes (one single morpheme that has parts before and after the root, like German past tenses e.g. 'ge-mach-t') and infixes (morphemes inserted into the root, the only example I can think of being the old Fish Called Wanda 'unbe-f**n-lievable').

This leads me to encourage supplying the engine with a many-to-many set of words. Use the same root with different affixes, but also use the same affixes with different roots.

Of course, your problem set is probably reduced to a single family of languages, so maybe you won't have to take all this into consideration, but these are the sorts of questions I had immediately.

This is definitely a very studied problem, and though I think it can be solved for small situations and small data sets with relative ease, I'd encourage research into what Carnegie-Mellon, the University of Edinburgh, and the University of Texas have done in this direction.

Finally, if you're going to work with English, you'll need to write everything phonetically. For example, The silent 'e' that gets deleted when adding a suffix that begins with a vowel may become a problem ('believ-able'). Once again, if you've done linguistics for ten minutes you know what a chore anything in English is.

Hope this isn't too much. Good luck with this. I'd like to hear more about it if you get some good stuff working.

MM

update: Turkish! Cool! IPA is definitely the way to go, but the problem is: which IPA? Can you get the stuff to work in Unicode? If you can, you can do all sorts of normal pattern matching (regex) using Perl 5.6. If you only use Sil, I'm sure there's still a way to do it, but it may be more difficult. That's one of the principle things I'm working on (a bridge between Sil and Unicode), but haven't quite done yet.

If I ever end up getting all my stuff done, we may be able to correspond on some of this stuff. Hope I wasn't overly cautionary there.


In reply to Re: Perl and Morphology by Maestro_007
in thread Perl and Morphology by justinNEE

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (19)
    As of 2014-09-22 20:43 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (200 votes), past polls