Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Junk NOT words

by Molt (Chaplain)
on Oct 30, 2002 at 17:04 UTC ( #209147=note: print w/replies, xml ) Need Help??

in reply to Junk NOT words

Not sure why you need this, but one possibility for an initial feel would be to look into building up a letter-by-letter Markov Chain from a large sample of text, and then use that to see how 'likely' the phrase is to be a word based on letter distribution. This will at least give you an idea of the Englishness of a phrase.

This would need playing with to see if it work reasonably, and to tune the scoring metric. This happily recognise Lewis Carroll-style nonsense as real words ('bewarethejabberwockmyson') though, which is bad if you do literally want to check it's real words, but good when you know your dictionary is potentially more limited than the words you're identifying.

One final warning.. beware of the three/four letter thing. It'll break on word boundaries unless you add all combinations letters possible where one word meets another. Better to treat these as penalties.

Replies are listed 'Best First'.
Re: Re: Junk NOT words
by broquaint (Abbot) on Oct 30, 2002 at 17:44 UTC
    Further to Molt's suggestion of using Markov Chains, there is a module on CPAN which provides their functionality in the form of Algorithm::MarkovChain.


Re: Re: Junk NOT words
by kshay (Beadle) on Oct 31, 2002 at 16:25 UTC
    I wonder if you' d get better results by building a Markov chain based not on the frequency with which a letter follows another individual letter, but on the frequency with which a letter follows a given pair of letters.

    I recently did something like this for a web application that goes the other direction: instead of trying to recognize letter combinations as "words," it generates "words" that "make sense" phonetically based on a given data set—in this case, lists of names. You can play with it here. Try a couple of different categories and you'll see that the resulting made-up names are quite phonetically distinctive; they really do seem French or Shakespearean (and within that, male or female) depending on what category you select.

    I got the basic approach from this page by Chris Pound. I can't post the code because I wrote it for my employer, but basically I took each list of names, calculated the frequencies, and generated a Perl library containing a long series of assignments like this:

    @{$pairs{'ri'}}{qw(a c e n s u)} = (3, 1, 1, 3, 1, 1);

    That means that within this data set, "ri" is followed by "a" three times, "c" one time, etc.

    Anyway, I'd be curious how this approach might work for word recognition, and how the results for "pair-letter" frequencies might differ from those obtained with "letter-letter" frequencies.