in reply to
Most of the email spam I get is:
- Character sets I can't read count as undeciperable,
right? 306MB and counting. 166MB of that is
GB2312 alone. (This is since August 27, 2002.)
The various ks_c_ charsets between them account
for another 60MB.
- Stuff either doesn't specify what charset it's in
or that's theoretically in character sets I
can potentially read (mainly, UTF8, which I
unfortunately can't filter because some people
in the open-source community write English messages
in it in preference to ASCII or Latin-1, for no
discernible reason), but the subject line contains
either long strings of non-alphanumeric characters,
or nothing but alphanumeric characters, probably
also counts as undecipherable. Another 141MB.
A handful of these have long strings of punctuation
in the subject, but most of them are Unicode
messages written in a non-Latin writing system.
141MB since September 2003 when I wrote the rule.
- That virus from a while back, "See the attached
file for details", 235MB.
- Assorted miscellany my filters didn't catch,
166MB (between 2004 April 23 and December 6;
I start a new bin for this periodically so
I can calculate the impact per-day and see
how much it's increasing).
- I did get one CPAN bug report once... for
some reason I filed that under nnml:perl.*
rather than under nnml:spam.*, go figure.
The unfiltered stuff (which lands in my inbox
and gets shifted manually) is what annoys me most,
and I'm continually looking for ways to reduce it,
without getting false positives. (My experiments
with Bayesian filtering were a wash; after training
ifile on my entire very large corpus of mail, I found
that I had to continually go through the whole spam
bin for false positives. With the system I use now,
I don't go through the filtered ones, only the
unfiltered ones that land in my inbox.)
Some of the kinds of spam that land in my inbox
include the following:
- Messages with an enigmatic or vague subject
line (that looks like a Markov chain or
random dictionary words) and no content --
absolutely nothing in the body at all, no
HTML part, no attachment, no nothing.
I seem to get a fair amount of this, and I'm
confused as to what possible reason the
spammers could have for sending it.
- 419s. I haven't found a solid way to detect
them (without false positives) yet.
- Phony giveaways
- Adverts for warez
- Adverts for medical products that do not,
in fact, exist: ways to reverse the aging
process, cures for cancer, and the like
- Spam written in Latin characters, but in a
language I don't read. Spanish predominates
in this category, but I've seen German, French,
and I think Italian. If I get any Portuguese,
I probably mistake it for Spanish.
- Spam written using non-Latin characters (but
without specifying the charset as such, either
because it's not specified at all or because
it's unicode) that slips past the filter rule
for non-alphanumeric subject lines by throwing
in alphanumeric characters in a few spots.
- Various prescription meds adverts that slip
past my filtering rules. Most of them seem
to slip past, even though I've tried to be
clever with my regular expressions. I write
stuff like "^Subject.*[Vv].?[Ii1l|].?[Aa@].?[Gg].?[Rr].?[Aa@]"
but they still find other ways to say it and
slip past. I think they use lookalike Unicode
characters. Did I mention that Unicode is a
plague and a nuissance? Yeah.
- Sundry other nonsense and junk.
However, even the stuff that gets filtered is a
significant annoyance, because of the bandwidth it
uses. I'm on 33.6 dialup here, so retrieving my
mail takes a few minutes; when most of what I'm
retrieving is unsolicited bulkmail, it's
annoying to have to wait for that.