Think about Loose Coupling | |
PerlMonks |
Re: From a SpamAssassin developerby Anonymous Monk |
on Aug 18, 2002 at 12:02 UTC ( [id://190975]=note: print w/replies, xml ) | Need Help?? |
I grant you that Paul Graham's traffic is easier to tell apart from spam than a marketing person's. I also submit that the system works much better for a single person with focussed interests than for multiple people with rather different interests. (Particularly when people disagree on spam. I consider chain mail spam. I know people who do not who send the junk to me occasionally...) However I would suggest looking very closely at his approach rather than just saying, He is doing Bayesian filtering, we do Bayesian filtering, worked better for him than us, must just be his data set. The fact is that he has tuned the numbers of his approach quite a bit, and some of that tuning is "wrong" from a strict Bayesian approach, but is probably very "right" from a spam elimination point of view. In particular if a word has only appeared in one or the other body of email, the probability that he assigns to it is .99 or .01 respectively. That means that if he repeatedly gets spams for the same products (which most people do), references to those products almost immediately become labelled as spam. Conversely approving a single email from a person goes a long way towards labelling any email from that company, person, or about that topic (based on subject keywords) as non-spam. A Bayesian approach to deciding how strong of evidence a given word is that something is spam would involve assigning a prior distribution and then modifying that upon observation. This would take several more observations to learn what words you do or do not like than Paul Graham's very rapid categorization process does. He then compounds this by artificially limiting his analysis to the 15 most distinctive words that he saw, which means that he is heavily biased towards making a decision based on rapid categorizations from a small section of the sample set. In other words Paul's algorithm likely works very well, but not necessarily for the theoretical reasons that he thinks applies.
In Section
Meditations
|
|