in reply to Re: Re: Spam filtering regexp - keyword countermeasure countermeasure
in thread Spam filtering regexp - keyword countermeasure countermeasure

Sorry, I guess I wasn't specific enough.
What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in.

You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though)

For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably.

To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength.

I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out.

So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it.
Then you learn on good and bad mail and the structures learn how the weights work for that.
Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly.

do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks.

In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin.
I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them)

-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.
  • Comment on Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure

Replies are listed 'Best First'.
Re: Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure
by John M. Dlugosz (Monsignor) on May 13, 2003 at 21:48 UTC
    So, Spamassasin works on characters, not words?

    Maybe I'll try that as an alternative to POPFile, if it runs locally and on Windows.

    —John

      SpamAssassin has phrases that it looks for that come about from the development team running genetic algorithms to see what and how to score sections of text from messages. The ones that win out in the genetic process make it to the top phrase count. (the bayesian analysis will work on chars or phrases - you just don't want to make it only distinct words - you want it to be effectively statistics on the characters - spaces and bits - then it can learn and just use statistics to your favor)

      But that in itself isn't what makes SpamAssassin really good - if you sort out your spam and nonspam into folders and set it to learn on those - then it will learn on those (although that makes it slower).

      I'm a big fan of spamassassin and use the most recent code - although it doesn't seemed to have changed much lately. I went from 500 spams a day, down to 100, and then after tweaking spamassassin got down to one a day that would sneak through, then one a week - and after a few months of it I now no longer see any of my spam (unless I go and look into the file I have it sorted out into).
      For months I checked to see if it was grabbing mail that it shouldn't be - and it only did once, and that was because my mom wasn't on the whitelist and her dial-up Mindspring account was getting enough points to make it think it was spam.

      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.