Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re^2: Spam filtering and regular expressions

by fokat (Deacon)
on Jul 30, 2005 at 19:30 UTC ( #479638=note: print w/replies, xml ) Need Help??

in reply to Re: Spam filtering and regular expressions
in thread Spam filtering and regular expressions

I agree with jhourcle's words:

(...) distinctions are context sensitive (...)

This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

Oh... and UTF is going to make for a very, very large set of glpyhs.

Indeed. This is why you must cap the amount of replacements to do when using this method.

Best regards

-lem, but some call me fokat

  • Comment on Re^2: Spam filtering and regular expressions

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://479638]
[erix]: haha Chump's handshake shaken
[erix]: surely such a handshake is enough to impeach the idiot :)

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2017-05-24 06:59 GMT
Find Nodes?
    Voting Booth?