Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Junk NOT words

by Anonymous Monk
on Oct 30, 2002 at 16:49 UTC ( #209141=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to think my way through the problem of knowing when an arbitrary sting of characters is just junk instead of any arbitrary combination of words with all pounctuation removed.

For instance -- I want a script to recognize:

hudjheuhfnnvgxvbchnfhfujryyfgbch is not a phrase
but: whererangesarealltherage is a nonsensical phrase

My one thought is to try to break it down into bits using verbs to help in the process. One could than state that a string of consenants longer than n characters cannot be a word. Having done that, one could do a dictionary search on the bits containing volwels to see if they appear in words -- or better still, create a dictionary of three or four character bits which could be parts of actual words.
The only three or four character bit I can quickly see in the first arbitary string that is part of a word is "hud"

Perhaps there are better suggestions for solving this problem.

Replies are listed 'Best First'.
Re: Junk NOT words
by Molt (Chaplain) on Oct 30, 2002 at 17:04 UTC

    Not sure why you need this, but one possibility for an initial feel would be to look into building up a letter-by-letter Markov Chain from a large sample of text, and then use that to see how 'likely' the phrase is to be a word based on letter distribution. This will at least give you an idea of the Englishness of a phrase.

    This would need playing with to see if it work reasonably, and to tune the scoring metric. This happily recognise Lewis Carroll-style nonsense as real words ('bewarethejabberwockmyson') though, which is bad if you do literally want to check it's real words, but good when you know your dictionary is potentially more limited than the words you're identifying.

    One final warning.. beware of the three/four letter thing. It'll break on word boundaries unless you add all combinations letters possible where one word meets another. Better to treat these as penalties.

      Further to Molt's suggestion of using Markov Chains, there is a module on CPAN which provides their functionality in the form of Algorithm::MarkovChain.


      I wonder if you' d get better results by building a Markov chain based not on the frequency with which a letter follows another individual letter, but on the frequency with which a letter follows a given pair of letters.

      I recently did something like this for a web application that goes the other direction: instead of trying to recognize letter combinations as "words," it generates "words" that "make sense" phonetically based on a given data set—in this case, lists of names. You can play with it here. Try a couple of different categories and you'll see that the resulting made-up names are quite phonetically distinctive; they really do seem French or Shakespearean (and within that, male or female) depending on what category you select.

      I got the basic approach from this page by Chris Pound. I can't post the code because I wrote it for my employer, but basically I took each list of names, calculated the frequencies, and generated a Perl library containing a long series of assignments like this:

      @{$pairs{'ri'}}{qw(a c e n s u)} = (3, 1, 1, 3, 1, 1);

      That means that within this data set, "ri" is followed by "a" three times, "c" one time, etc.

      Anyway, I'd be curious how this approach might work for word recognition, and how the results for "pair-letter" frequencies might differ from those obtained with "letter-letter" frequencies.

Re: Junk NOT words
by jarich (Curate) on Oct 31, 2002 at 01:49 UTC
    Some while ago there was an excellent article in the tpj about simulating typos on both qwerty and dvorak keyboards.

    Sean M. Burke's premise was that typos on dvorak keyboards tended to look more like wierd words rather than random mis-typings than their counterparts on qwerty keyboards. In discussion of this he wrote a plausibility function I think you may find very useful.

    My suggestion is that you take a very large piece of text (grab something off gutenberg) remove all the spaces and look at the occuring 3-letter groupings (trigraphs). Then examine the text you're facing. With some experimentation you should be able to determine an approximate cut-off level whereby your text is probably a nonsensical phrase vs a non-phrase.

    Studies of language for the purposes of cryptography have shown that of all possible trigraphs only some small percentage are actually common in typical language. You should have no problem distinguishing rubbish from language this way. :)

    Hope it helps.


Re: Junk NOT words
by davis (Vicar) on Oct 30, 2002 at 17:27 UTC
    Quick, ill-considered node before I dash off home:

    Why not measure the letter distribution of the sample being tested. I.e, count the number of 'a's, 'b's and so on. You could then compare the calculated distribution with one appropriate for your language, tables of which could almost certainly be found by googling, and if the actual result is within acceptable bounds, then assume it's ok.

    A little further explanation: if the input's random(ish), then the letter distribtion should be fairly close to random (ignoring keyboard patterns). Real, written english (at least) has a much higher proportion of 'e's, 't's, and 's's.

    Is this going out live?
    No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist
      I have had some success using letter pair frequencies as a language identifier for OCR'd text -- letter pair distribution is a much better metric than just letter distribution. Pick out all of the two-letter pairs in your string, and compare the ten most frequent pairs against some pairs you got from analyzing a training text ( just remember to take out the whitespace from the training text first ). If it exceeds some similarity threshold, it is 'real' text, otherwise gibberish.

      For example, my own values for most common letter pairs in English are these:

      English => ['he','th','in','er','an','ou'],

      I have found that a better than 50% match is a pretty reliable indicator

      Be aware that the path you are going down quickly leads to AI-complete problems in natural language processing. This is another one of those programming tasks that seems very easy until you try to do it on real data.

Re: Junk NOT words
by BrowserUk (Pope) on Oct 30, 2002 at 19:24 UTC

    This is a little crude, but given a less eclectic words file than the one I have, it might do better. In any case, it shouldn't be too hard to come up with a heuristic for making the decision, possibly based on the number or proportion of 1 and 2 character words or the average word length for example.

    This code


    C:\test>209141 hudjheuhfnnvgxvbchnfhfujryyfgbch may consist of the words hu dj he uhf n nv g xv bch n f h fu j ry y f g bch whererangesarealltherage may consist of the words where range sare all the rage thequickbrownfoxjumpsoverthelazydog may consist of the words the quick brown fox jumps over the lazy dog hsdkjfhuewfklJLKSDossfdlkjaslduasl may consist of the words hs dk j f hu ew f kl J L KS Doss f d l k jas l du as l PeterPiperPickedAPeckOfPickledPepper may consist of the words Peter Piper Pick edA Peck Of Pickled Pepper lazinessimpatiencehubris may consist of the words laziness impatience hubris C:\test>

    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: Junk NOT words
by Dr. Mu (Hermit) on Oct 31, 2002 at 06:41 UTC
    If you go for the digraph and trigraph methods, be sure that the text you analyze for letter-pair and -triple frequencies is similar to what you'll be looking at. You mentioned wanting to recognize an "arbitrary combination of words". This is much different than recognizing words combined to form an English sentence. If you analyze plain text from, say, a novel, you'll get a higher preponderance of "th", "he", and "the" than you would from a jumble of words picked at random from the dictionary.

    Also, remember that digraph and trigraph analysis is probabilistic. This means that the larger the string of characters you're trying to analyze (i.e. the bigger your sample size), the better your chances of making the correct inference. Bear in mind that a string of 50 characters contains only 48 trigraphs. This doesn't scratch the surface of even the most frequent of the 17576 possible three-letter combinations, so simple frequency analysis may not be meaningful. Better would be to categorize all 17576 trigraphs into just a few groups by rank, according to the sample text you've pre-analyzed. Partitioning them into ten 10-percentile groupings would be one example. Then each trigraph in your subject string would simply be a member of one of the ten groups. The more evenly distributed they are among these ten groups, the better the chances that you have a string of words.

    This is a very interesting problem (to me anyway), and I hope you'll keep the monks posted on your progress!

Re: Junk NOT words
by fruiture (Curate) on Oct 30, 2002 at 17:51 UTC
    sub resolve { my $strref = shift; my $strpos = shift; my $dictfh = shift; return [] if length($$strref) == $strpos; my $words; my $reset_fh = tell($dictfh); seek $dictfh,0,0; while(defined(local $_ = <$dictfh>)){ chomp; if( substr($$strref,$strpos,length) eq $_ ){ print "$_ at $strpos\n"; if( my $w = resolve($strref,$strpos+length,$dictfh) ){ $words = [$_,@$w]; last; } print "<- $strpos\n"; } } seek $dictfh,$reset_fh,0; return $words; } open my $dfh,'<','/usr/dict/words' or die $!; my $string = 'whererangesarealltherage'; my $words = resolve( \$string, 0, $dfh ); print $words ? "@$words" : 'no words' , "\n"; close $dfh;

    I've included some output that shows how the program "backtracks".

Re: Junk NOT words
by bart (Canon) on Oct 31, 2002 at 10:39 UTC
    Focus on the vowels. There are always fewer vowels than consonants, especially in junk, and all (English) words need at least one. So pick the next vowel, and see if you can make at least one word of it if you take a few letters in front and after it as well. Or: is this snippet of a few letters a substring of an existing word?

    For many letters, that will be the case. For example, in your phrase, I can recognize "a", "real", "he", "her", "age". So don't throw away too many candidates upfront: none of these words here are the words of which the phrase actually consists. It's not because you found a word that it is the word. Note that you can hardly find any words at all in the junk string. So this one is quickly dismissed.

    Well, eventually, you'll have them all fit the whole string, so you have only words and no excess letters between them. (Er, what do you do with spelling errors?). So if you do find a likely substring around a vowel, concentrate on the next vowel that isn't part of this word. You should find two words that fit both sequences with no letters between them. If you don't, and not for any group around each vowel, the string must be junk.

    Well, HTH.

        There are always fewer vowels than consonants, especially in junk, and all (English) words need at least one.

      'Rhythms' doesn't have a vowel. ;)

        a e i o u and sometimes y.
        I remember hearing once that the only work in the english dictonary that doesn't have a vowel is 'nth'.
      Always? You bait us.

      I recall the word "strength" which has a lone vowel and count 'em, 7 consonants!

      I have a word file I made by stripping the entry tags (and massaging a bit) the dictionary on Gutenburg. I suppose I could scan that to see what a histogram of the actual proportion is, if I were so inclined.

      all (English) words need at least one (vowel)

      The vast majority, but not all. Two counterexamples that come to mind offhand are "cwm" (meaning cirque) and "nth" (first, second, ..., nth). Those are both uncommon enough that they might never occur in the data in question, but we don't really know enough about the problem to make that assumption.

      Update: Sorry, I didn't notice at first that bizzach had already mentioned "nth" above. No plagiarism intended ;-)

              $perlmonks{seattlejohn} = 'John Clyman';

      There are also a whole host of places in english stripped of puctuation where non-vowel containing "words" would crop up. Eg. Mr Mrs Dr Jnr etc.

      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
        OK, OK, I concur! So there are a few sequences of consonants with no vowel that can be considered a word. However, there aren't many. The whole mechanism can remain the same, except that, if you're left with a string of consonants between words, that doesn't immediately means failure. Now you'll have to do an additional check to see if such a string exists of a sequence of these exceptions. I think that they are such a small minority that, for speed, it must be worth it extracting them all from a dictionary and storing them separately in a data file, before you even start.

        Am I alone, in feeling that the whole search system as I proposed, is very similar to how a regex may try to match a pattern, in a "penny machine"? Pick a candidate, try every possibility with it in turn, backtrack...

Re: Junk NOT words
by gjb (Vicar) on Oct 30, 2002 at 19:41 UTC

    I'm rather surprised to see this problem turn up, could you (perhaps in private tell me why it is of interest to you?

    Reason is that I spent some time on it for a commercial project I'm unfortunately not at liberty to discuss. There's a very nice and elegant solution to it.

    Regards, -gjb-

Re: Junk NOT words
by jotti (Scribe) on Nov 01, 2002 at 11:55 UTC
    I once made a program that analysed text and put up a frequency table for letter triplets. Using this table I then created nonsense text. A sample text file of 1 kB was enough to produce a table that could create some nonsense text that unmisstakeabely was English. Or German, Swedish or what ever language I sampled.

    If you have sampled some megas of plain English text, you would have a pretty accurate table of triplet frequencies. Then you could sample another mega of text and pick sentences from that text. For each sentence, you pick each triplet and sum its frequency. You have to divide the sum with the number of triplets in the sentence, which gives you an average. Compute the average of all averages, and the standard deviation of the averages. This will give a picture of what is expected in real English. The standard deviation tells how reliable this method is. This method would OTOH recognise my nonsense English as real English.

Re: Junk NOT words
by Anonymous Monk on Oct 31, 2002 at 15:56 UTC
    before you do a dictionary search you could (especially if your text is large enough) calculate the distribution frequency of the text (ie % of each letter) for large amounts of text the distribution freqency of a lanquage is stable, Tables exist for the distribution frequency of each letter in the English language.
Re: Junk NOT words
by John M. Dlugosz (Monsignor) on Oct 31, 2002 at 21:46 UTC
    I think it would be easy enough to check a dictionary for a prefix string, adding another char if it isn't; then recurse on the rest of the string. Backtrack, trying a longer prefix, if it doesn't pan out down the road.

    In fact, a Perl6 pattern could easily do this.

    In Perl5 regex, I would think the "code" assertion would do, but the docs are inconsistant and I've never used it since the docs scared me off with "experimental". The issues are (1) can it "fail", and (2) how to get the most-recient pattern's contents.

Re: Junk NOT words
by pizza_milkshake (Monk) on Nov 01, 2002 at 20:37 UTC
    this seems to do the trick
    #!/usr/bin/perl -l # file storing all legitimate words, one per line my $DICT = qq(/usr/share/dict/words); # terms to test for validity my @TEST = qw( pizzamilkshake perlmonks pearlmonks hellothere heytherebaby hey123 timexyz whereangelsare ); # build dict hash my %WORDS; open DICT, "< $DICT" or die $!; while (<DICT>){ chomp; # chop off whitespace if it's there $WORDS{uc $_}++; # force UC, add key to %WORDS } close DICT; my (@subs, $giveup, $p1, $p2, $sub); for my $test (@TEST){ $test = uc $test; # force word to UC at the beginning @subs = (); # reset sub-match array $giveup = $p1 = $p2 = 0; # not giving up, starting at the start while ( !$giveup && # else we've thrown in the towel $p1 < length($test) # else found match ){ $p1++; $sub = substr $test, $p2, $p1-$p2; # grab next substring #print STDERR "sub: $sub"; if ($WORDS{$sub}){ # if it matches a legal word... #print STDERR "MATCH ($sub) in $test"; push @subs, [ $p1, $p2 ]; # successful path, save it $p2 = $p1; # advance p2 to the end of the current match } elsif ($p1 >= length($test)){ # at the end of the string wit +h no match # if the entire string doesn't match a word or we have now +here to # backtrack... if ($p2 == 0 || @subs == 0){ #print STDERR "giving up on $test"; $giveup++; # nowhere to go } else { #print STDERR "backtracking..."; # reset p1 and p2 to last state and try to get a longe +r match ($p1, $p2) = @{$subs[$#subs]}; pop @subs; # delete last item... it's path is a dead e +nd } } } print ("$test: " . ($giveup ? "NO" : "YES")); }
    perl -MLWP::Simple -e'getprint ""' |less
      i've put an updated copy of this script at:

      i've added some simple optimizations to noticeable speed up performance on longer examples. perl -e'$_=q#: 13_2: 12/"{>: 8_4) (_4: 6/2"-2; 3;-2"\2: 5/7\_/\7: 12m m::#;s#:#\n#g;s#(\D)(\d+)#$1x$2#ge;print'

Re: Junk NOT words
by Anonymous Monk on Nov 01, 2002 at 04:58 UTC
    How about you index the dictionary file and then work you're way through the string character by character matching against the word. When the next letter results in no further branches in the index it takes that as a word, if the next word results in a dead end try the previous word again minus one character.
    (excuse me for not actually checking against a dictionary for obsucure words)
    r=dead end
    "where" removed
    a=dead end
    "angels" removed
    l = dead end
    "area" removed
    not making sense, backup
    stop at "a"
    "are" removed
    t=dead end
    "all" removed
    ugh... hope you get the idea.
      The problem with this algorithm, is that it has a VERY bad worse case performance, it's in O(2^n), where n is the length of the string. Meaning that as the strings get larger, this problem will become insolvable by deterministic methods. Some sort of heuristic is needed.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://209141]
Approved by Tanalis
Front-paged by jarich
[Corion]: marto: How's things? I hope the kids are fine and you too!
[Corion]: Oh yay. I wonder why a very simple change in a program doesn't elicit a diff, and now I see that my diff program seems to have a bug ;)
[1nickt]: marto thanks for asking, so far so good. A pretty modern stack and decent procedures, although rather too much home-built stuff (e.g. a logging role that should tries to duplicate Log::Any).
[Corion]: No. It's just that I'm comparing the same output file twice, instead of comparing the output files of the two runs %-)
[Corion]: Lo and behold, running a program with the correct input files yields the correct (and expected) output. Yay me.
[1nickt]: Got a MacBook and am expected to develop directly on it, ironic given the recent thread about that.
[marto]: Corion, some not too serious issues with the kids, hopefully, other than that just dealing with commuting by car again in the winter, not much fun so far, and there's no real 'bad' weather yet :)
[marto]: 1nickt, glad to hear it, I'm sure you'll bring improvements to their stack in due course ;)

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2017-12-11 11:41 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (289 votes). Check out past polls.