Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

update:Realised this looks long. It is quite long. The poster above has given you a very fine answer, but it's intended for quite a limited set of cases, all from the same root. I've tried to go a little bit deeper to identify smaller morphological elements in the words, abstractly, hence this is wordy.

++ interesting question. I've been playing around with it on the sly for an hour or two, but a project manager keeps coming over and asking me why his website's feature boxes are still broken, so I haven't got a complete answer for you, just some suggestions, which might be helpful or otherwise

To start with, I think you might need to give your programme a few more hints to try and get it to analyse your data. At the moment, you're giving it a bare english translation, and expecting it to be able to identify grammatical elements that might correspond to morphological elements in the original. Instead, you might make it a lot easier if you pre-analyse each instance into the grammatical parts that make it up. What I'm thinking you might end up with is a data structure that looks like this:

my %analysed = ( baSlar => ['plural'], baSlarimiz => ['plural', 'possessed by us'], baSimda => ['possessed by me', 'locative'], );

You could identify any number of syntactic features in a given word this way: verbal moods or aspects, nominal cases, numbers or genders. I'm not sure whether each data set you're working on will come from the same root or not, but I'll assume they do (it's not a huge problem if they don't, though**). Then, extract the root using something like the subroutine supplied by the previous poster, or whatever:

sub findroot { my @words = @_; my %stems; foreach ( @words ) { my @letters = split //; do { $stems{join ('', @letters)}++ } while my $stem = pop(@letters); } # dump all the possible stems that don't match every word map { delete $stems{$_} if $stems{$_} < scalar(@words) } keys %ste +ms; #return the stem - i.e. the longest common element return [ sort { length $b <=> length $a } keys %stems ]->[0]; }

This will give you a set of strings that are groups of morphemes (the words without the roots). For each of these strings, you know it's got to contain a set number of individual morphemes representing the grammatical features. Eg.

imda: 'possessed by me', 'locative'

Assuming each of the grammatical elements is represented by a non-null string morpheme, there's a limited number of ways that 'imda' can indicate 'in my head'. You could generate all these permutations (this is where I got hassled and had to stop coding ... so this is broken:)

use Algorithm::Permute qw( permute ); # not so fast as other modules, but it compiled OK on cygwin my @permutations = possibles('imda','possessed by me','locative'); sub possibles { my ($string, @items) = @_; my @permutations; my $maxlength = length($string) - scalar(@items) + 1; permute { ##### this is hardcoded my @lengths = (2,1,1); do { my %perm; my @split = getsplit($string, @lengths); for ( my $j = 0; $j < @items; $j++ ) { $perm{$items[$j]} = $split[$j]; } push(@permutations, \%perm); #print Dumper \%perm; } while ( @lengths = nextlength($maxlength, @lengths) ); } @items; } sub getsplit { my ($string, @lengths) = @_; my @splits; my $offset; foreach (@lengths) { push(@splits, substr($string, $offset, $_) ); $offset += $_; } return @splits; } ###### THIS DOESN'T WORK sub nextlength { my ($maxlength, @lengths) = @_; my $incrnext; foreach (@lengths) { if ( $_ >= $maxlength ) { $incrnext = ( $; $_ = 1; } else { $_++ if $incrnext; $incrnext = 0; } } return if $incrnext; return @lengths; }

And then you'd have a set of guesses at the ways in which the suffix could be representing the grammatical form:

@possibles = ( { 'i' => 'locative', 'mda' => 'poss.by.me', }, { 'im' => 'locative', 'da' => 'poss.by.me', }, ........ );

You should be then able to cross reference all the different cases that you have for 'locative' or 'plural' or 'possessed by me/us', and see which permutations are true for all the different cases. Of course, this is a slightly 'brute force' method of approaching this problem, and the results are still likely to need some interpretation; however, it could save a lot of manual guessing. Having some knowledge of the phonemics of the language, or knowing one or two of the morphemes in advance is likely to make it A LOT easier.

Of course, all this assumes that your morphemes are all suffixes, not prefixes, and that there isn't anything tricksy like sandhi taking place between suffixes. But it might be a start for you.

Have fun

/=\

**update: it occurs to me that it doesn't really matter if you pre-strip the root at all. Instead, you could skip that step altogether, and just identify the root as another grammatical element, eg:

my %analysed = ( baSlar => ['root:head', 'plural'], baSlarimiz => ['root:head', 'plural', 'possessed by us'], baSimda => ['root:head', 'possessed by me', 'locative'], );

In reply to Re: Perl and Morphology by ViceRaid
in thread Perl and Morphology by justinNEE

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (3)
    As of 2014-09-23 00:37 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (209 votes), past polls