http://www.perlmonks.org?node_id=336331

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

I have a lot of strings which are potentially nonsense, but might contain a few english words. For example:

owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
contains the words "apple", "cherry" and "blossom".

My job is to score each string so that strings that are composed entirely of words get a better score than strings that are not. Strings with the fewest number of words will also get a higher score than strings with many words.

Is this a job for recursive descent parsing, or is there a faster way to be doing this? For this task, I am willing to trade off a little accuracy for speed. I'm trying to remember back to my cryptography class to think if there was any kind of algorithm to decide if a string was likely to be an english word.

My dictionary is going to consist of about 250,000 words. I can store them on the disk in any fashion I want. Is there any kind of format (perhaps some db file format) that would lend itself well to the kind of lookups that I am going to need to be doing?

I see there is a Text::ExtractWords module on CPAN, but it doesn't seem to be actually implemented. There is also Games::TrackWord, but it wants me to loop through every single dictionary word to try to find it. That would take too long. Is there a CPAN module that would be a good fit for me?

Thanks

Replies are listed 'Best First'.
Re: Finding dictionary words in a string.
by Caron (Friar) on Mar 13, 2004 at 06:58 UTC
    contains the words "apple", "cherry" and "blossom".

    Actually, there is much more ...

    #!/usr/bin/perl -w use strict; my $search_word = shift or die "search word required\n"; my @words = (); my $count = 0; open WORDS, "/usr/share/dict/words" or die "can't open words file\n"; while (<WORDS>) { chomp; printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/i; } close WORDS;

    And here is the result:

    $ perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
      1 Al
      2 apple
      3 as
      4 ask
      5 blossom
      6 cherry
      7 err
      8 he
      9 her
     10 Herr
     11 Io
     12 Los
     13 loss
     14 mow
     15 of
     16 so
     17 we
    

    I know it's a brute force, but it takes less than half second to give this result.

    Consider that Perl RegEx engine is using a Boyer-Moore search algorithm, which is probably the fastest one available for text matching.

      This is a great little program!

      I couldn't resist speeding it up a bit. A commonly used trick in search engines is to shift everything to upper case.

      Avoiding the i flag in the regular expression roughly doubles the speed, on my machine at least.

      I also used Storable to create a slightly faster-loading dictionary.

      #!/usr/bin/perl -w use strict; use Storable; my $search_word = uc shift or die "search word required\n"; my @words = (); my @dict; my $rdict; my $count = 0; if (not -e 'words.sto') { open WORDS, '/usr/share/dict/words' or die "can't open words file\n"; while (<WORDS>) { chomp; push @dict, uc $_; } close WORDS; $rdict= \@dict; store $rdict, 'words.sto'; } else { $rdict= retrieve('words.sto'); } for (@$rdict) { printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/; }
      It should work perfectly the first time! - toma

        I can't figure out how you got that speed improvement. The time you save by removing the /i flag is lost when you build a large array in memory (45,000 words in my machine).

        Besides, do you really think an all-uppercase output is better?

        $ time perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
          1 Al
          2 apple
          3 as
          4 ask
          5 blossom
          6 cherry
          7 err
          8 he
          9 her
         10 Herr
         11 Io
         12 Los
         13 loss
         14 mow
         15 of
         16 so
         17 we
        0.30user 0.00system 0:00.30elapsed 99%CPU 
        
        $ time perl wsearch2.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
          1 AL
          2 APPLE
          3 AS
          4 ASK
          5 BLOSSOM
          6 CHERRY
          7 ERR
          8 HE
          9 HER
         10 HERR
         11 IO
         12 LOS
         13 LOSS
         14 MOW
         15 OF
         16 SO
         17 WE
        0.48user 0.00system 0:00.48elapsed 99%CPU 
        

        Of course, I ran both scripts at least 4 times, to make sure that the storable file was created and that both scripts were reading their input from the disk cache.

        Here are the timings for my machine. With this type of program, the results will depend on the relative performance of the hard disk, the CPU, the memory, and the caches.

        Even on a single machine, the results depend on the operating system kernel, and most importantly, with the build of perl. My vendor perl is an unusual build, and my own build is more tuned to my machine.

        The kernel I have been running, 2.4.22-6.ll.rh80, is an experimental low-latency kernel used for real-time applications. The file system is ext3.

        Timing, seconds Kernel 2.4.22-6.ll.rh80 Kernel 2.4.18-14
        Vendor perl, loop 2.938 2.569
        My perl, loop 1.957 1.677
        Vendor perl, Storable, no /i 1.267 0.966
        My perl, Storable, no /i 0.687 0.567

        If you prefer lower case to make the output look better, just use the lc function instead of the uc function.

        I think the lesson is that performance improvements are often difficult to generalize, so you have to try it on your own system and see what works.

        UPDATE: The Code Smarter suggestion is very good. For the Storable code, timings for the vendor perl on 2.4.22-6.ll.rh80 the timings are down to about 0.21 seconds and with my perl they are at about 0.13 seconds - too fast for my primitive benchmark.

        It should work perfectly the first time! - toma
Re: Finding dictionary words in a string.
by Corion (Patriarch) on Mar 13, 2004 at 09:06 UTC

    To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached.

    The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:

    # a negative number indicates how many chars you just matched a -> p -> p -> l -> e -> -5 b -> a -> r -> -3 -> l -> o -> s -> s -> o -> m -> -7 -> u -> s -> e -> -6 u -> s -> e -> -3

    The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away.

    Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue.

    Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string.

    Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.

    #!/usr/bin/perl -w use strict; use Data::Dumper; my $dict = {}; while (<DATA>) { chomp; insert($_); }; sub insert { my @chars = split '', shift; my $len = - scalar @chars; my $curr = $dict; while (@chars) { my $char = shift @chars; if (! exists $curr->{$char}) { $curr->{$char} = {} }; $curr = $curr->{$char}; }; $curr->{''} = $len; }; # print Dumper $dict; my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejfbeetl'; my $pos = 0; my @running; while ($pos < length $str) { my $char = substr( $str, $pos, 1 ); push @running, $dict; for my $word (@running) { if (exists $word->{''}) { print "Found ", substr( $str, $pos + $word->{''}, - $word->{''} +), "\n"; }; if (exists $word->{$char}) { $word = $word->{$char}; } else { undef $word; }; }; @running = grep { defined } @running; $pos++; }; __DATA__ apple blossom cherry blouse bar bee beetle
Re: Finding dictionary words in a string.
by kvale (Monsignor) on Mar 13, 2004 at 04:35 UTC
    Instead of testing the string against every dictionary word, it would be better to search along the string for words that fit the current pattern. Suppose that you have some minmum word length n of dictionary words. Then at each position in the string, get that substring of length n and search your dictionary of sorted words for that pattern; this is a O(n*log(n)) operation. Then check each of the possible words against the rest of the string. For a reasonable n, this will cut down search time drastically.

    This basic idea carried to its logical conclusion forms the basis for a using a trie data structure to find words fast. There is a module Text::Trie that implements this:

    use Tree::Trie; use strict; my($trie) = new Tree::Trie; $trie->add(qw[aeode calliope clio erato euterpe melete melpomene mneme + polymnia terpsichore thalia urania]); my(@all) = $trie->lookup(""); my(@ms) = $trie->lookup("m"); $" = "--"; print "All muses: @all\nMuses beginning with 'm': @ms\n"; my(@deleted) = $trie->remove(qw[calliope thalia doc]); print "Deleted muses: @deleted\n";
    A trie consisting of 250,000 words will take up a good deal of space, but even a truncate trie will speed things up.

    -Mark

Re: Finding dictionary words in a string.
by tachyon (Chancellor) on Mar 13, 2004 at 14:05 UTC

    Text::ExtractWords is implmented. I presume you make this statement as you don't understand why the .pm is virtually empty and only contains stuff about Dynaloader and bootstrap. The code itself is in C in ExtractWords.xs. You will need a C compiler to install it.

    cheers

    tachyon

      I made the statement because according http://search.cpan.org, the .pm file is the only file in the distro and the whole package is at version 0.08.

        That is simply incorrect. You really need to learn how to navigate CPAN better before making such erroneous statements.

        http://search.cpan.org/~hdias/Text-ExtractWords-0.08/ - have a close look.

        The version is 0.08 with an update time of 13 Oct 2003. Look at the list of Special files and as usual you will see Changes MANIFEST Makefile.PL README. You then see a list of .pm files, in this case 1. The xs and C files are never listed on this page AFAIK.

        If you click the Browse link near top right next to the date you will see all the files in the distro including the /t dir and C code files like .h .c .xs and anything else included in the distro. Browse simply takes you to a dir where the distro you get from download was unpacked for your viewing pleasure.

        If you want to see the modules history read the Changes file. Version numbers per se depend entirely on the authors whim. I start at 0.01 and just keep moving up. I don't think anything I have written has gotten to 1.x yet. Others authors start at 1.0 for the initial release so there is little value in the numbering. In the case of this module it was released May last year and had a series of updates.

        cheers

        tachyon

        I think this is the third time in the last couple weeks I've seen a negative reference about a module with a version < 1.

        It is very common for modules to start with version 0.01. If you are looking for an indicator of stability/maturity, the version number is not the place to look. Read the pod, the changes, the code; ignore the version.

        Caveat: a beta version is indicated by an underscore in the version string, e.g. Foo-Bar-0.01_01.

Re: Finding dictionary words in a string.
by TilRMan (Friar) on Mar 13, 2004 at 05:58 UTC

    How fast is a naive solution for you?

    #!/usr/bin/perl -w use strict; open WORDS, "/usr/share/dict/words" or die; my @words = <WORDS>; chop @words; close WORDS; printf "Loaded %d words.\n", scalar @words; # Favor long words @words = sort { length $b <=> length $a } @words; my $re = join '|', map quotemeta, @words; while (<>) { study; # ?? 1 while s/$re/[*]/io; print; }

    -- 
    LP^>

      Okay, then, how fast is a not-so-naive solution for you? It's still not a real algorithm for finding likely English words, though. (Seems like there should be one out there, but I've never seen it.) So I improvise. This one generates a list of letter triples from the word list and finds those first. That speeds things up considerably.

      One thing that is bothering me is that you said you need to score each string by how well the words fit together. That sounds like a hairy problem if there are lots of ways to divide a string into words, each with different leftover garbage letters, different word lengths, etc. Whatever method you use to find the words will probably need to know something about how you score the strings so that it can look for the best fit.

      my @words = #...; my %triples; foreach (@words) { for (my $i = 0; $i < length($_) - 2; $i++) { push @{$triples{substr $_, $i, 3} ||= []}, $_; } } # Keep most popular triples my @triples = grep { @{$triples{$_}} > 10 } keys %triples; my %lensum; foreach my $t (@triples) { foreach my $w (@{$triples{$t}}) { $lensum{$t} += length $w; } } @triples = sort { $lensum{$b} <=> $lensum{$a} } @triples; my $re = join '|', map quotemeta, @triples; while (<>) { print; my @hopes; while (/$re/gio) { push @hopes, @{$triples{$&}}; } my $word = join '|', map quotemeta, sort { length($b) <=> length($a) } @hopes; 1 while s/$word/[*]/i; print; }

      -- 
      LP^>

Re: Finding dictionary words in a string.
by Vautrin (Hermit) on Mar 13, 2004 at 05:57 UTC

    If you want speed, perhaps you should search for syllables or several letters. For instance, the string "tccvssqqqlllxxccvv" is unlikely to contain any english words because it's all consonants. You could have a list of exceptions to search for (for instance, "cvs"), which would be much smaller (and thus much quicker) then using the entire dictionary. You could also extend the concept to vowels (i.e. "aeuaeey" can contain no words because it contains vowels which are not i (a word).

    It's definitely possible to process your dictionary by running it through a program to chop it up into syllables. (And by syllables, I mean a combination of n letters. The more letters you use the more accurate, and the slower the algorithm will be). Add to that a few common sense rules like the one about all consonants for 10 syllables probably not being a word, and it should work.


    Want to support the EFF and FSF by buying cool stuff? Click here.
Re: Finding dictionary words in a string.
by Anonymous Monk on Mar 13, 2004 at 04:32 UTC

    My first stab:

    #!c:/perl/bin/perl -w use strict; my %index; open my $fh, '<', 'dict.dat' or die "open failed: $!"; while (<$fh>) { chomp; next if length($_) < 3; push @{$index{ lc(substr($_, 0, 3)) }}, $_; } close $fh; #print join(', ', map { @$_ } $index{'rad'}), $/; my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf'; for my $i (0 .. length($str)-3) { for my $j (3..length($str)-$i) { my $substr = substr($str, $i, $j); my $index = substr($substr, 0, 3); next unless exists($index{$index}); for (map { @$_ } $index{$index}) { print $_, $/ if $_ eq $substr; } } }
Re: Finding dictionary words in a string.
by graff (Chancellor) on Mar 13, 2004 at 18:40 UTC
    The scoring you describe seems like the more interesting part, and could guide the choice of a search method, but there's not enough detail in the OP figure it out.

    In particular, there needs to be some sort of trade-off between the two scoring criteria: "strings that are composed entirely of words get a better score" is somewhat at odds with "strings with the fewest number of words will get a higher score". Are you talking about solutions that involve just non-overlapping words? If so, then you would need to compare various possible parses of the input string. How would you score a string like "labsolve"? Does "lab, solve" score better or worse than "absolve, one letter left unused"?

    In either case, it might be helpful to have your dictionary stored as an array of hashes; the index into the array is the length of words in that hash. (The hashes themselves could be sub-indexed by first letter, perhaps, or first two or three letters, as suggested by the AM in the first reply.) This way, you know before starting on the input string what the longest substring is that you need to match against, and for each substring that you test, you only need to compare it to a limited number of dictionary entries.

      What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.

      There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count

        There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode)

        # get the domain part dropping www. and passing back the # domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au ) my ($domain, $tld) = get_domain( $url ); # chop domain into all possible substrings say 3-16 chars long, retrun + ary ref # there are very few valid well known words > 16 chars, virtually none + > 24 chars my $tokens = tokenize( $domain ); # get the possible words ordered by length(word) DESC ie longest first # use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus +e my $words = get_real_words_from_dict( $tokens ) # substitute out the words, as we remove longest first # we aviod finding substrings like 'be' in 'beetle' my $residual = $domain; my @words = (); for my $word( @$words ) { # we may have duplicates of same word push @words, $word while $residual =~ s/\Q$word\E/; } # remove '-' from residual so 'come-here' will be two words, no residu +al $residual =~ s/-//g; # work out % residual $residual = 100*$residual/$domain; # So now we have our data # @words 0..N is number of words found # $residual is the %residual on that domain name # $tld is the domain name # say we inserted into a Db table like: CREATE TABLE urls ( url CHAR(75), words INT, residual FLOAT, tld CHAR(10), ) "INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)", $url, scalar @words, $residual, $tld You can now generate reports. Essentially you want something like: SELECT url FROM urls WHERE words >= 1 ORDER BY words ASC, residulal ASC GROUP BY tld

        This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.

        cheers

        tachyon

Re: Finding dictionary words in a string.
by bart (Canon) on Mar 16, 2004 at 20:37 UTC
    I have some serious doubts on the scale of the operation, but perhaps you could try applying Regex::PreSuf on the dictionary words, and match the string using this constructed regex. It looks fine in theory, except probably for memory constraints, and perhaps more stringent constraints on the size of the compiled regex.