|Syntactic Confectionery Delight|
Re^2: Spam filtering and regular expressionsby fokat (Deacon)
|on Jul 30, 2005 at 19:30 UTC||Need Help??|
I agree with jhourcle's words:
(...) distinctions are context sensitive (...)
This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.
The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.
A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.
Oh... and UTF is going to make for a very, very large set of glpyhs.
Indeed. This is why you must cap the amount of replacements to do when using this method.
-lem, but some call me fokat