Perl-Sensitive Sunglasses | |
PerlMonks |
Re: Finding dictionary words in a string.by Corion (Patriarch) |
on Mar 13, 2004 at 09:06 UTC ( [id://336346]=note: print w/replies, xml ) | Need Help?? |
To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached. The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:
The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away. Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue. Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string. Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.
In Section
Seekers of Perl Wisdom
|
|