in reply to Re: Re: Spam filtering regexp - keyword countermeasure countermeasure
in thread Spam filtering regexp - keyword countermeasure countermeasure
Sorry, I guess I wasn't specific enough.
What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in.
You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though)
For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably.
To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength.
I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out.
So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it.
Then you learn on good and bad mail and the structures learn how the weights work for that.
Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly.
do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks.
In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin.
I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them)
There are some odd things afoot now, in the Villa Straylight.