Finding dictionary words in a string.

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

I have a lot of strings which are potentially nonsense, but might contain a few english words. For example:

owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf

contains the words "apple", "cherry" and "blossom".

My job is to score each string so that strings that are composed entirely of words get a better score than strings that are not. Strings with the fewest number of words will also get a higher score than strings with many words.

Is this a job for recursive descent parsing, or is there a faster way to be doing this? For this task, I am willing to trade off a little accuracy for speed. I'm trying to remember back to my cryptography class to think if there was any kind of algorithm to decide if a string was likely to be an english word.

My dictionary is going to consist of about 250,000 words. I can store them on the disk in any fashion I want. Is there any kind of format (perhaps some db file format) that would lend itself well to the kind of lookups that I am going to need to be doing?

I see there is a Text::ExtractWords module on CPAN, but it doesn't seem to be actually implemented. There is also Games::TrackWord, but it wants me to loop through every single dictionary word to try to find it. That would take too long. Is there a CPAN module that would be a good fit for me?

Thanks

Comment on Finding dictionary words in a string.

Replies are listed 'Best First'.

Re: Finding dictionary words in a string.
by Caron (Friar) on Mar 13, 2004 at 06:58 UTC

contains the words "apple", "cherry" and "blossom".

Actually, there is much more ...

#!/usr/bin/perl -w
use strict;

my $search_word = shift
   or die "search word required\n";

my @words = ();
my $count = 0;
open WORDS, "/usr/share/dict/words"
     or die "can't open words file\n";
while (<WORDS>) {
    chomp;
    printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/i;
}
close WORDS;
[download]

And here is the result:

$ perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 Al
  2 apple
  3 as
  4 ask
  5 blossom
  6 cherry
  7 err
  8 he
  9 her
 10 Herr
 11 Io
 12 Los
 13 loss
 14 mow
 15 of
 16 so
 17 we

I know it's a brute force, but it takes less than half second to give this result.

Consider that Perl RegEx engine is using a Boyer-Moore search algorithm, which is probably the fastest one available for text matching.

[reply]
[d/l]

Re: Re: Finding dictionary words in a string.

by toma (Vicar) on Mar 14, 2004 at 05:58 UTC

I couldn't resist speeding it up a bit. A commonly used trick in search engines is to shift everything to upper case.

Avoiding the i flag in the regular expression roughly doubles the speed, on my machine at least.

I also used Storable to create a slightly faster-loading dictionary.

#!/usr/bin/perl -w
use strict;
use Storable;

my $search_word = uc shift
   or die "search word required\n";

my @words = ();
my @dict;
my $rdict;
my $count = 0;
if (not -e 'words.sto') {
  open WORDS, '/usr/share/dict/words'
    or die "can't open words file\n";
  while (<WORDS>) {
    chomp;
    push @dict, uc $_;
  }
  close WORDS;
  $rdict= \@dict;
  store $rdict, 'words.sto';
}
else {
  $rdict= retrieve('words.sto');
}

for (@$rdict) {
  printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/;
}
[download]

It should work perfectly the first time! - toma

[reply]
[d/l]
[select]

Re: Re: Re: Finding dictionary words in a string.

by cchampion (Curate) on Mar 14, 2004 at 08:02 UTC

I can't figure out how you got that speed improvement. The time you save by removing the /i flag is lost when you build a large array in memory (45,000 words in my machine).

Besides, do you really think an all-uppercase output is better?

$ time perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 Al
  2 apple
  3 as
  4 ask
  5 blossom
  6 cherry
  7 err
  8 he
  9 her
 10 Herr
 11 Io
 12 Los
 13 loss
 14 mow
 15 of
 16 so
 17 we
0.30user 0.00system 0:00.30elapsed 99%CPU 

$ time perl wsearch2.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
  1 AL
  2 APPLE
  3 AS
  4 ASK
  5 BLOSSOM
  6 CHERRY
  7 ERR
  8 HE
  9 HER
 10 HERR
 11 IO
 12 LOS
 13 LOSS
 14 MOW
 15 OF
 16 SO
 17 WE
0.48user 0.00system 0:00.48elapsed 99%CPU

Of course, I ran both scripts at least 4 times, to make sure that the storable file was created and that both scripts were reading their input from the disk cache.

[reply]

Re: Re: Re: Re: Finding dictionary words in a string.

by Anonymous Monk on Mar 15, 2004 at 04:04 UTC

Re: Re: Re: Finding dictionary words in a string.

by toma (Vicar) on Mar 14, 2004 at 18:28 UTC

Even on a single machine, the results depend on the operating system kernel, and most importantly, with the build of perl. My vendor perl is an unusual build, and my own build is more tuned to my machine.

The kernel I have been running, 2.4.22-6.ll.rh80, is an experimental low-latency kernel used for real-time applications. The file system is ext3.

Timing, seconds Kernel 2.4.22-6.ll.rh80 Kernel 2.4.18-14

Vendor perl, loop 2.938 2.569

My perl, loop 1.957 1.677

Vendor perl, Storable, no /i 1.267 0.966

My perl, Storable, no /i 0.687 0.567

If you prefer lower case to make the output look better, just use the lc function instead of the uc function.

I think the lesson is that performance improvements are often difficult to generalize, so you have to try it on your own system and see what works.

UPDATE: The Code Smarter suggestion is very good. For the Storable code, timings for the vendor perl on 2.4.22-6.ll.rh80 the timings are down to about 0.21 seconds and with my perl they are at about 0.13 seconds - too fast for my primitive benchmark.

It should work perfectly the first time! - toma

[reply]
[d/l]
[select]

Re: Finding dictionary words in a string.
by Corion (Patriarch) on Mar 13, 2004 at 09:06 UTC

To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached.

The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:

  # a negative number indicates how many chars you just matched
  a -> p -> p -> l -> e -> -5 
  b -> a -> r -> -3
    -> l -> o -> s -> s -> o -> m -> -7
              -> u -> s -> e -> -6
  u -> s -> e -> -3
[download]

The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away.

Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue.

Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string.

Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.

#!/usr/bin/perl -w
use strict;
use Data::Dumper;

my $dict = {};

while (<DATA>) {
  chomp;
  insert($_);
};

sub insert {
  my @chars = split '', shift;
  my $len = - scalar @chars;
  my $curr = $dict;
  while (@chars) {
    my $char = shift @chars;
    if (! exists $curr->{$char}) { $curr->{$char} = {} };
    $curr = $curr->{$char};
  };
  $curr->{''} = $len;
};

# print Dumper $dict;

my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejfbeetl';

my $pos = 0;

my @running;

while ($pos < length $str) {
  my $char = substr( $str, $pos, 1 );
  push @running, $dict;
  for my $word (@running) {
    if (exists $word->{''}) {
      print "Found ", substr( $str, $pos + $word->{''}, - $word->{''} 
+), "\n";
    };
    if (exists $word->{$char}) {
      $word = $word->{$char};
    } else {
      undef $word;
    };
  };
  @running = grep { defined } @running;
  $pos++;
};

__DATA__
apple
blossom
cherry
blouse
bar
bee
beetle
[download]

[reply]
[d/l]
[select]

Re: Finding dictionary words in a string.
by kvale (Monsignor) on Mar 13, 2004 at 04:35 UTC

This basic idea carried to its logical conclusion forms the basis for a using a trie data structure to find words fast. There is a module Text::Trie that implements this:

use Tree::Trie;
 use strict;

 my($trie) = new Tree::Trie;
$trie->add(qw[aeode calliope clio erato euterpe melete melpomene mneme
+ polymnia terpsichore thalia urania]);
my(@all) = $trie->lookup("");
my(@ms)  = $trie->lookup("m");
$" = "--";
print "All muses: @all\nMuses beginning with 'm': @ms\n";
my(@deleted) = $trie->remove(qw[calliope thalia doc]);
print "Deleted muses: @deleted\n";
[download]

-Mark

[reply]
[d/l]

Re: Finding dictionary words in a string.
by tachyon (Chancellor) on Mar 13, 2004 at 14:05 UTC

Text::ExtractWords is implmented. I presume you make this statement as you don't understand why the .pm is virtually empty and only contains stuff about Dynaloader and bootstrap. The code itself is in C in ExtractWords.xs. You will need a C compiler to install it.

cheers

tachyon

[reply]

Re^2: Finding dictionary words in a string.

by ehdonhon (Curate) on Mar 13, 2004 at 15:52 UTC

http://search.cpan.org

[reply]

Re: Re^2: Finding dictionary words in a string.

by tachyon (Chancellor) on Mar 13, 2004 at 16:12 UTC

That is simply incorrect. You really need to learn how to navigate CPAN better before making such erroneous statements.

http://search.cpan.org/~hdias/Text-ExtractWords-0.08/ - have a close look.

The version is 0.08 with an update time of 13 Oct 2003. Look at the list of Special files and as usual you will see Changes MANIFEST Makefile.PL README. You then see a list of .pm files, in this case 1. The xs and C files are never listed on this page AFAIK.

If you click the Browse link near top right next to the date you will see all the files in the distro including the /t dir and C code files like .h .c .xs and anything else included in the distro. Browse simply takes you to a dir where the distro you get from download was unpacked for your viewing pleasure.

If you want to see the modules history read the Changes file. Version numbers per se depend entirely on the authors whim. I start at 0.01 and just keep moving up. I don't think anything I have written has gotten to 1.x yet. Others authors start at 1.0 for the initial release so there is little value in the numbering. In the case of this module it was released May last year and had a series of updates.

cheers

tachyon

[reply]

Re: Re^2: Finding dictionary words in a string.

by ysth (Canon) on Mar 14, 2004 at 10:08 UTC

It is very common for modules to start with version 0.01. If you are looking for an indicator of stability/maturity, the version number is not the place to look. Read the pod, the changes, the code; ignore the version.

Caveat: a beta version is indicated by an underscore in the version string, e.g. Foo-Bar-0.01_01.

[reply]

Re: Finding dictionary words in a string.
by TilRMan (Friar) on Mar 13, 2004 at 05:58 UTC

How fast is a naive solution for you?

#!/usr/bin/perl -w
use strict;

open WORDS, "/usr/share/dict/words" or die;
my @words = <WORDS>;
chop @words;
close WORDS;

printf "Loaded %d words.\n", scalar @words;

# Favor long words
@words = sort { length $b <=> length $a } @words;
my $re = join '|', map quotemeta, @words;

while (<>)
{
  study; # ??
  1 while s/$re/[*]/io;
  print;
}
[download]

--
LP^>

[reply]
[d/l]

Re: Re: Finding dictionary words in a string.

by TilRMan (Friar) on Mar 13, 2004 at 07:28 UTC

Okay, then, how fast is a not-so-naive solution for you? It's still not a real algorithm for finding likely English words, though. (Seems like there should be one out there, but I've never seen it.) So I improvise. This one generates a list of letter triples from the word list and finds those first. That speeds things up considerably.

One thing that is bothering me is that you said you need to score each string by how well the words fit together. That sounds like a hairy problem if there are lots of ways to divide a string into words, each with different leftover garbage letters, different word lengths, etc. Whatever method you use to find the words will probably need to know something about how you score the strings so that it can look for the best fit.

my @words = #...;

my %triples;
foreach (@words) {
  for (my $i = 0; $i < length($_) - 2; $i++) {
    push @{$triples{substr $_, $i, 3} ||= []}, $_;
  }
}

# Keep most popular triples
my @triples = grep { @{$triples{$_}} > 10 } keys %triples;

my %lensum;
foreach my $t (@triples) {
  foreach my $w (@{$triples{$t}}) {
    $lensum{$t} += length $w;
  }
}

@triples = sort { $lensum{$b} <=> $lensum{$a} } @triples;
my $re = join '|', map quotemeta, @triples;

while (<>) {
  print;
  my @hopes;

  while (/$re/gio) {
    push @hopes, @{$triples{$&}};
  }

  my $word = join '|', map quotemeta,
             sort { length($b) <=> length($a) } @hopes;

  1 while s/$word/[*]/i;
  print;
}
[download]

--
LP^>

[reply]
[d/l]

Re: Finding dictionary words in a string.
by Vautrin (Hermit) on Mar 13, 2004 at 05:57 UTC

If you want speed, perhaps you should search for syllables or several letters. For instance, the string "tccvssqqqlllxxccvv" is unlikely to contain any english words because it's all consonants. You could have a list of exceptions to search for (for instance, "cvs"), which would be much smaller (and thus much quicker) then using the entire dictionary. You could also extend the concept to vowels (i.e. "aeuaeey" can contain no words because it contains vowels which are not i (a word).

It's definitely possible to process your dictionary by running it through a program to chop it up into syllables. (And by syllables, I mean a combination of n letters. The more letters you use the more accurate, and the slower the algorithm will be). Add to that a few common sense rules like the one about all consonants for 10 syllables probably not being a word, and it should work.

Want to support the EFF and FSF by buying cool stuff? Click here.

[reply]
[d/l]

Re: Finding dictionary words in a string.
by Anonymous Monk on Mar 13, 2004 at 04:32 UTC

My first stab:

#!c:/perl/bin/perl -w

use strict;

my %index;

open my $fh, '<', 'dict.dat' or die "open failed: $!";
while (<$fh>) {
   chomp;
   next if length($_) < 3;

   push @{$index{ lc(substr($_, 0, 3)) }}, $_;
}
close $fh;

#print join(', ', map { @$_ } $index{'rad'}), $/;


my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf';

for my $i (0 .. length($str)-3) {
   for my $j (3..length($str)-$i) {
      my $substr = substr($str, $i, $j);
      my $index  = substr($substr, 0, 3);

      next unless exists($index{$index});

      for (map { @$_ } $index{$index}) {
         print $_, $/ if $_ eq $substr;
      }
   }
}
[download]

[reply]
[d/l]

Re: Finding dictionary words in a string.
by graff (Chancellor) on Mar 13, 2004 at 18:40 UTC

In particular, there needs to be some sort of trade-off between the two scoring criteria: "strings that are composed entirely of words get a better score" is somewhat at odds with "strings with the fewest number of words will get a higher score". Are you talking about solutions that involve just non-overlapping words? If so, then you would need to compare various possible parses of the input string. How would you score a string like "labsolve"? Does "lab, solve" score better or worse than "absolve, one letter left unused"?

In either case, it might be helpful to have your dictionary stored as an array of hashes; the index into the array is the length of words in that hash. (The hashes themselves could be sub-indexed by first letter, perhaps, or first two or three letters, as suggested by the AM in the first reply.) This way, you know before starting on the input string what the longest substring is that you need to match against, and for each substring that you test, you only need to compare it to a limited number of dictionary entries.

[reply]

Re: Re: Finding dictionary words in a string.

by ehdonhon (Curate) on Mar 13, 2004 at 18:53 UTC

What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.

There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count

[reply]

Re: Re: Re: Finding dictionary words in a string.

by tachyon (Chancellor) on Mar 13, 2004 at 19:57 UTC

There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode)

# get the domain part dropping www. and passing back the
# domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au )
my ($domain, $tld) = get_domain( $url );

# chop domain into all possible substrings say 3-16 chars long, retrun
+ ary ref
# there are very few valid well known words > 16 chars, virtually none
+ > 24 chars
my $tokens = tokenize( $domain );

# get the possible words ordered by length(word) DESC ie longest first
# use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus
+e
my $words = get_real_words_from_dict( $tokens )

# substitute out the words, as we remove longest first 
# we aviod finding substrings like 'be' in 'beetle'
my $residual = $domain;
my @words = ();
for my $word( @$words ) {
    # we may have duplicates of same word
    push @words, $word while $residual =~ s/\Q$word\E/;
}

# remove '-' from residual so 'come-here' will be two words, no residu
+al
$residual =~ s/-//g;

# work out % residual
$residual = 100*$residual/$domain;

# So now we have our data 
# @words 0..N is number of words found
# $residual is the %residual on that domain name
# $tld is the domain name

# say we inserted into a Db table like:
CREATE TABLE urls (
    url CHAR(75),
    words INT,
    residual FLOAT,
    tld CHAR(10),
)

"INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)",
    $url, scalar @words, $residual, $tld

You can now generate reports. Essentially you want something like:

SELECT url FROM urls
WHERE words >= 1
ORDER BY words ASC, residulal ASC
GROUP BY tld
[download]

This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.

cheers

tachyon

[reply]
[d/l]

Re: Finding dictionary words in a string.
by bart (Canon) on Mar 16, 2004 at 20:37 UTC

Regex::PreSuf

[reply]

Back to Seekers of Perl Wisdom