http://www.perlmonks.org?node_id=191096


in reply to Re: From a SpamAssassin developer
in thread Bayesian Filtering for Spam

Well, I've now written what I think is basically what Paul has written in his lisp code (including stuff like discarding all but the most interesting 15 features) and tested it.

The results are (unsurprisingly to me) not as accurate as Paul describes on mixed types of messages.

The most important thing to remember about doing anything with probabilities is to not mix up your training and validation data sets. I get the feeling that Paul isn't doing that in calculating his statistics. I get zero false positives too when I validate against the training data set.

However, on the plus side, the amount of data stored by his system compared to the pure Bayesian one used in AI::Categorize is significantly smaller. So I'll probably switch over to using this one instead.

I'll post some of the code to the SpamAssassin list later today probably, in case someone wants to play with it some more.

Replies are listed 'Best First'.
Re: Re: Re: From a SpamAssassin developer
by Elian (Parson) on Aug 19, 2002 at 07:11 UTC
    Don't be too surprised that Paul's solution's not a good general-purpose one. His data set's probably quite small, with good locality, and odds are he made sure to skew his results to his data. It's not that his methods are bad for his needs, just that his needs are rather different than most people's.
      I'm not surprised. Not even slightly - see my original post.

      The biggest thing about statistical analysis is you simply cannot test it on the training data set. I get 100% accuracy when I do that. And it's not surprising. I'm speculating that's what PG did. But I could be wrong. And also the fact that the training often overfits. None of this is news to anyone versed in machine learning (which I'm starting to be ;-)

      Matt.

Re: Re: Re: From a SpamAssassin developer
by Anonymous Monk on Aug 20, 2002 at 01:23 UTC
    What I would wonder is not whether his method works as well in a general-purpose environment as it does for him. I didn't expect it to. It is rather whether it works well enough to be useful, and how it performs relative to the pure Bayesian one that you already had.