Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

What are the monks doing with Perl and Linguistics?

by allolex (Curate)
on May 05, 2003 at 09:14 UTC ( #255595=perlmeditation: print w/ replies, xml ) Need Help??

As many of you know, Perl was invented by someone who was a linguist (among other things). A lot of English-speakers associate the word "linguist" with 'someone who knows a lot of languages', but I'm talking about people involved in the science of language. I am told there are a whole lot of linguists who use Perl, but I've only caught a few supersearch-assisted glimpses of what people out there are doing with Perl and linguistics.

There have been threads on the use of Perl with Morphology, Natural language sentence construction, Term weight, and Artificial Intelligence, for example. There is also Mike Hammond's new book Programming for Linguists: Perl for Language Researchers (PDF sample chapter). The various Lingua modules on CPAN are immensely helpful, and the WordNet stuff comes in handy as well. Of course, Larry Wall has at least one Usenet post on the subject and googling will turn up many more related posts. Some researchers have done quite a lot with Perl. Fiammetta Namer's lemmatizer/morphological tagging tool for French, "FLEMM" comes to mind. And I know there are many more caches of information elsewhere.

But what I'm really wondering is what people at the Monastery are doing with Perl and Linguistics. I'm guessing that there are quite a few people out there doing Corpus research working on Natural Language Processing projects, or maybe you've been involved with one in the past. Or maybe you've heard of one where people are using Perl.

So, what are you all up to? TIA


Comment on What are the monks doing with Perl and Linguistics?
Re: What are the monks doing with Perl and Linguistics?
by Abigail-II (Bishop) on May 05, 2003 at 09:57 UTC
    You might want to look at what Sean Burke (Torgox on has done over the years.


      Thanks for the tip. =) Sean Burke's homepage is chock full o' stuff.


Re: What are the monks doing with Perl and Linguistics?
by crenz (Priest) on May 05, 2003 at 10:22 UTC

    This is not really science... but your post brought up fond memories: One of the first things I did with Perl was to create statistics on letter, letter pairs, letter triplets etc. distribution in a given text, and then used this to create random texts that resembled the original text. The results were quite good, and rather funny, especially when using several source texts (e.g. mixing the KJV bible and Edgar Allan Poe).

    After a few years, I learned that what I had implemented is a very common algorithm in a lot of different fields: Markov Chains :).

      LOL. I can imagine the results. Actually, you really don't have to mix texts if the original (even one author) is sufficiently cryptic. Noam Chomsky has been able to maximize ambiguity in his work and the Chomskybot (written in Perl as well) is even more fun than Chomsky himself. If they don't understand what you're saying, you must be a genius. ;-) Let's hear it for Markov chains!


      After a few years, I learned that what I had implemented is a very common algorithm in a lot of different fields: Markov Chains :).

      yay, markov chains! i use markov chains to generate random posts for my weblog. it uses the database of all other posts to build the transition matrix.

      anders pearson

Re: What are the monks doing with Perl and Linguistics?
by matsmats (Monk) on May 05, 2003 at 17:25 UTC

    Funny that you should ask this today. I was just about to look around the monastery for the same.

    I'm just now finalizing a system to automatically determine if general news articles in a large newsfeed is 'bad news or good news'. i.e. to mark negative events, if its negative stock market reports, a local sports team losing, negative criticism, crime etc.

    Perl has been excellent for implementing such a system and I find the tools at CPAN invaluable, but as I only work with Scandinavian languages a lot of the modules are not available to me as they are a too specific to English. I would guess that the stemmers are a basic tool for everyone doing computational linguistics, though, and they are available for all the languages I need them for.

    Mats Stafseng Einarsen

      Sounds like interesting stuff.

      Jon Kleinberg at Cornell is doing some research into topic "burst" identification in real-time text data streams. You can have a look at some of his preliminary results here. It might be interesting to compare whether your results match his. His stuff is all for English (of course).

      I'd be curious to hear about how your doing your identification---I'm assuming a kind of basic separation of lexemes into categories (positive, negative, neutral), or do you have a level of abtraction there. If you'd like to share, you can reach me via e-mail from my homepage or /msg me.

      Where the language modules are concerned, I really feel for you. Due to time constraints, we have kludged together a sentencizer for French because there was not one readily available to suit our needs, but due to everything else being prioritized above it, it is utter crap (does the job mostly right, but with unpredictable results).


        I've kludged together a sentencizer for Norwegian (based on Text::Sentence) myself, so I know what you mean. It's not that bad, but the fact that some people are working at doing the same for audio streams is rather humbling.

        Kleinbergs work seems interesting. I vaguely remember reading about it in the news, so I look forward to looking into it.

        The negative/positive identifier is company work, so I'll probably be better of not elaborating on it for now.

Re: What are the monks doing with Perl and Linguistics?
by elusion (Curate) on May 05, 2003 at 23:03 UTC
    In my sparetime (what little of it there is), I'm working on a Machine Translation engine in Perl. What is that, you ask? Think babelfish. When I originally approached the project a year and a half ago, I decided to use Ruby. But over time, I realized it didn't have the power I needed.

    Six months ago, I started over, in Perl. I've implemented most of the parsing, though I haven't written much of the grammar, and I'm working on the basics of the generator.

    As far as others using perl in this area, there are two places to check. The first is, a /. type site (minus the trolls) for linguistics. Do a search for perl and you'll come up with a dozen articles, two of which are on the front page right now.

    The second is A search there turns up some articles. You might also check

    elusion :

      Cool. Thanks a lot for the links. =) So what sort of linguistic model are you using for your engine?. Babelfish appears to be a mix of ideas, but a project with a similar name, Babel, uses HPSG (Head-Driven Phrase Structure Grammar)

      Speaking of Babelfish, I'm about to become famous, at least locally. The German radio show Wortlaut will at some point have a contribution from Max Schönherr, who did some interesting, funny stuff with Babelfish and Systran. Anyway, my colleagues and I are going to be on the show (reading stuff---don't want to ruin the surprise).

      I dabble in Ruby myself, so I think you might not be giving it a fair shake, but really anything you can do in Ruby, you can do in Perl---just in a different way. Anyway, there is no Ruby Monks.


        I'm a bit of an amateur at linguistics, but... I'm taking a top-down interlingual approach. That's about all there is to know, at least right now.

        I think part of my beef with Ruby may have been that I didn't have enough experience in it. I've found Perl much easier though... especially when you get to difficult problems. Also, perl is quite a bit faster than ruby is.

        elusion :

Re: What are the monks doing with Perl and Linguistics?
by mcarthur (Initiate) on May 06, 2003 at 06:28 UTC
    We're working with an associational framework on English text. The psychology-based framework called HAL (Hyperspace Analogue to Language) creates associations between words (or concepts depending on who you talk to) in text. You can then do some fun dimension reduction techniques like LSA (Latent Semantic Analysis) or Concept Indexing or random projection. All of it is done in perl. We're not using PDL at the moment, but may do so in the future. If you're interested, our publications are here - look at the top for the ECSCW paper for the most recent one.

      I myself am interested in semantic/knowledge extraction, association, and representation. I really like the idea of concept indexing and even though there is a practical side to all of this as well, I was thinking of the value of such research to large-scale socio-psychological research where accurate generalizations of individual behavior within a group take center stage.

      We're working on collocation extraction for a French dictionary we are building. I plan on using part of our corpus for categorizing lexemes according to an ontology I plan to extract from a broader range of corpora--basically using pre-existing encyclopedic knowledge to build an ontology instead of creating the ontology beforehand. I plan to use XML topic maps to do this. (I'm not even vaguely close to an implementation.)


Re: What are the monks doing with Perl and Linguistics?
by Mur (Pilgrim) on May 09, 2003 at 19:21 UTC
    Well, I dunno if this qualifies: we're using Lingua::* modules to analyze words for indexing on web pages. Specifically, if a user searches for "advertising", we check words for common stems and so find --
    • ... advert
    • ... advertise
    • ... advertised
    • ... advertiser
    • ... advertisers
    • ... advertises
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
    ...Nexcerpt...Connecting People With Expertise

      Very interesting stuff. I had a look at the "nexcepts" on your site. Yes, the Lingua derivational morphology modules (looks like Stem, Infinitive, Inflect) have provided some good results. It made me think about how I might go about doing something similar.

      One thing that might make your searches better is some way to account for morphology that is not just stem + ending, like pronounce/pronunciation/pronouncement. Also, grouping (near-)synonyms like "brotherly" and "fraternal" may improve your results. Of course my examples are a bit textbookish, but I'm sure that you can refine things using your expert knowledge about what sort of information your clients might want to look up.


Re: What are the monks doing with Perl and Linguistics?
by MrYoya (Monk) on May 09, 2003 at 21:16 UTC
    My job is purely perl and linguistics, and although I'm not a linguist by formal education I do have an interest in computational linguistics. I do a lot of work with WordNet and the perl modules out there, like Lingua::* and making my own modules. I'd say about a third to half of my work is just research such as reading papers, books, etc. However, I am not an expert in either linguistics or perl (yet).

    I would tell you the kinds of things I'm working on, but that'll have to wait. ;)

      You're very lucky to be able to do so much background reading. =)

      I tend to think that for linguistics, a formal education helps a lot. But then again, I tend to disagree with the methods for problem solving used by many linguists out there. For me, doing modern science involves knowing a bit about something, forming a hypothesis, collecting data, analysing data, and drawing conclusions from it, not just coming up with an idea, thinking about it for a while, and then coming to a conclusion based on intuition. That said, people using computers to help them do linguistics tend to have their reasons for doing so, something that usually involves crunching lots of data. As long as that data is in there, I'm happy (pretty much).

      Another thing you reminded me of and which may be of general interest is the source of a lot of contention in computational linguistics projects. Often NLP (Natural Language Processing) is something that is done by computer scientists who were not trained as linguists. Since CS is a generally a very math-based discipline, people with CS backgrounds often search for mathematical solutions to problems they encounter. You have a problem, analyse data, and come up with an algorithm to solve that problem. Often this analysis is problematic. Linguists tend to think in terms of what one might call "psychological reality". Psychological reality simply means that the algorithms used to solve a particular linguistic problem should reflect human language processing as much as possible. There can definitely be multiple solutions to a particular problem, but whereas CS people tend to look for simple and efficient solutions, to linguists it is still important to model human cognition along the way. Such systems tend to be robuster, which is a Good Thing.


Re: What are the monks doing with Perl and Linguistics?
by PetaMem (Priest) on May 12, 2003 at 09:22 UTC
    Aaaah. Such an exquisite Thread and I come only as of now to read it (by reading a thread at referencing it). Have a look at and for some demo at The companys products run to 99% on Perl.


      I played with the re-diacriticizer and had a bit of fun. It did correctly recognize the ambiguity of certain phrases/words like (den) Lastern (vices, trucks/lorries), das Lästern (backbiting, trash talking), usw. Definitely fun to play with. No code, though =(. At least none I could find. It would be interesting to see how it works.


Re: What are the monks doing with Perl and Linguistics?
by Anonymous Monk on Jun 11, 2003 at 19:18 UTC

      Very nice. A little more interesting for Historical Linguistics than for Computational Linguistics, but quite a good link for me personally. It has the Appendix Probi, which is basically a list of "don't write that, write this", which any Perl learner can identify with. ;) Thank you.


Re: What are the monks doing with Perl and Linguistics?
by Willard B. Trophy (Hermit) on Oct 07, 2003 at 20:40 UTC
    Collins Dictionaries were doing a lot of corpus linguistics using Perl when I left, back in 2002. They look after the Collins/Birmingham University Bank of English, which is a great big huge corpus. There are also a variety of monitor corpora, which are used to gauge changes in usage over time.

    Corpus data collection got a whole lot easier with the web ... ☺ -- Sitescooper is particularly handy for large-scale text collection (with permission, of course).

    bowling trophy thieves, die!

      I'm currently researching cross-lingual digital libraries and I use Perl, although I am fairly new to the language. I have just finished writing a light stemmer, some ngram code, some ngram comparaison code, and basically i'm at that 'generating stats' stage. I'm looking for similarities between documents, differences in them too, and then look at language and context, and so on. The idea is to make documents searchable in many different langauges. I did a masters where I used Java, and made a system that could retrieve a similar english document in french and kinda worked ;) I'm always interested in hearing what other are up to in that area, maybe we can swap some tools and share some ideas!! Ceejay

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://255595]
Approved by integral
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2015-01-30 13:33 GMT
Find Nodes?
    Voting Booth?

    My top resolution in 2015 is:

    Results (249 votes), past polls