Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Finding dictionary words in a string.

by Corion (Patriarch)
on Mar 13, 2004 at 09:06 UTC ( [id://336346]=note: print w/replies, xml ) Need Help??


in reply to Finding dictionary words in a string.

To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached.

The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:

# a negative number indicates how many chars you just matched a -> p -> p -> l -> e -> -5 b -> a -> r -> -3 -> l -> o -> s -> s -> o -> m -> -7 -> u -> s -> e -> -6 u -> s -> e -> -3

The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away.

Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue.

Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string.

Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.

#!/usr/bin/perl -w use strict; use Data::Dumper; my $dict = {}; while (<DATA>) { chomp; insert($_); }; sub insert { my @chars = split '', shift; my $len = - scalar @chars; my $curr = $dict; while (@chars) { my $char = shift @chars; if (! exists $curr->{$char}) { $curr->{$char} = {} }; $curr = $curr->{$char}; }; $curr->{''} = $len; }; # print Dumper $dict; my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejfbeetl'; my $pos = 0; my @running; while ($pos < length $str) { my $char = substr( $str, $pos, 1 ); push @running, $dict; for my $word (@running) { if (exists $word->{''}) { print "Found ", substr( $str, $pos + $word->{''}, - $word->{''} +), "\n"; }; if (exists $word->{$char}) { $word = $word->{$char}; } else { undef $word; }; }; @running = grep { defined } @running; $pos++; }; __DATA__ apple blossom cherry blouse bar bee beetle

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336346]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-03-29 00:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found