|laziness, impatience, and hubris|
Re: Efficient matching with accompanying databy LanX (Canon)
|on Jul 11, 2013 at 00:46 UTC||Need Help??|
perl-versions >5.9.2 have a trie optimization within the regex engine.
That is /(aaa|aab|aca)/ is internally optmized to (a(a(a|b)|ca))
so if you organize your $lexicon in a way where supplementary dictionary data are listed after the target-words and delimited with something like "\0" you can search quite efficiently
I successfully wrote a module parsing DB-dumps very efficiently like this.
Unfortunately the rights belong to my last employer, so you need to reinvent the wheel...:(
after rereading your post I have the impression that it's your lexicon which is static while the "tweets" always change.
In this case you have the swap the logic, just once produce a long regex out of the phrases in your lexicon and match them against all tweets.
Take care to sort the phrases by length, cause the first match will rule. Like this you don't to embed the dictionary data, just do a hash lookup with the matching word-groups.
( addicted to the Perl Programming Language)