... extending to more general linguistic modelling, ideally from a non-language-specific basis that can be adapted to different languages.

That's ambitious... but worth pursuing. The first thing that comes to my mind is (Hidden) Markov modelling, which has been demonstrated to do a decent job of drawing plausible "morphological" boundaries in a stream of text data in any given language. It appears that there are Markov modules on CPAN, but whether these are suitable to the task of language analysis is more than I know at present.

I do know that Perl is quite useful for handling a lot of "infrastructure" work relating to the management and handling of language data; e.g. developing and searching a lexicon, locating and displaying/highlighting tokens in a text stream, mapping across character encodings, etc. Of course, a lot of useful tools have already been developed (some in Perl, some in C(++)) -- check the archives at (and/or join) the CORPORA mailing list:

I'm sorry I can't give you any more detailed pointers or advice, but I hope this helps a little.

