I was thinking more along the lines of tie'ing the hashes to a DBM and doing the parse via cron at some kind of reasonable interval. Here are my thoughts now:
- I don't want to keep an entire corpus of 'bad' and 'good' emails forever. The system should update the current word counts instead of redo-ing the counts from scratch every time. This means that 'single word' vs 'phrasing' should be hammered out during the design phase, if I dump the mails from the system, I can't go back and reparse them :)
- I'm looking at the client interaction, how a client can/should flag a spam vs. flagging a 'good' message. I can see that there will be times when messages come in that are neither spam nor directed email. What about mis-addressed emails?
- Should the client be able to set the probability of the filter? Do I put messages above that probability into a separate folder, delete them outright, or add them to the 'bad' email count automatically? What would be the most sane default behaviour?
- In Graham's article, he talks about fiddling with the 'weight' of the good mail to get better results. What if you could put your thumb on the scale of the 'negativeness' of certain spam? Some messages are simply annoying, some spam is out and out offensive. Maybe I should have the program take that into account.
- Which parts of this need to be in a module and which parts need to be in the client? Right now I'm thinking that the module should be able to accept the full text of a message (including headers), parse it, and store it in the appropriate DBM (with those optional weights in the previous point). The module should be able to accept the full text of a message and return the parsed word list. The module should be able to take a hashref of words and return a 'score' based on the words and their counts. Everything else should be on the client side.
- How many messages do I need in the 'good' and 'bad' corpus before I can start relying on the probabilities it is giving me? Up to now, I've just been deleting spam, now I am storing them in a separate folder for using in my parse. This also goes into my client interaction design question, maybe basing the available probabilities on the number of messages in the corpus. (ex "You can only set the probability filter up to 50% with fewer than 200 spam messages.")
- I can see a group of super-users who can be relied on to make informed spam decisions. Maybe by classifiying the spam. New users wouldn't need to rebuild the spam corpus, but could import the word counts from super-users. ("I want to import the 'porn', 'medical hoaxes', and 'stock tips' word counts from other, trusted users.") Then each user could further refine those counts with their own spam.
Two big points I can see here are that the system learns without the user saying anything more than "This is spam", and that, because the counts are atomic, they can be shared. I have been reluctant to go with a black list because I think there is the possibility of abuse. Most spam filters require continual updating (which means that you have to be a sysadmin or you have to know what the hell you are doing.) I know that they are effective, I just don't want to have to think about it all the time (as a user or as a sysadmin).
That's about all I have to say about that for now. If you see some questions that I'm not asking, let me know.
oakbox