Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

(OT) Fighting spam

by Aristotle (Chancellor)
on Nov 16, 2003 at 09:00 UTC ( [id://307447]=perlmeditation: print w/replies, xml ) Need Help??

I've recently been catching up slowly but steadily with a lot of threads I'd put on the back burner, and just read Enough is Enough - Taking the fight back to the Internet scammers. Both the discussion on that thread and its subject in the broader sense reminded me of a number of observations, ideas, and concepts I've seen lately regarding that subject. While this node is not exactly on topic for the monastery, both automated sending and automated filtering of mail is an issue at PerlMonks frequently enough that I believe the following items are of general interest.

First off is something I first saw on Andy Lester's weblog (alias petdance), in an entry titled Content-based spam filtering is a dead-end path. It ties in with the observations in brother tachyon's post on aforementioned thread. The fact is that spammers are starting to fill their mails with innocent and/or random text while avoiding to directly mention their advertised goods(?) at all, instead circumphrasing them. Consequently, in the mid- to long-term, content based filters will become useless until artificial intelligence makes a significant breakthrough (ahem).

The only way to effectively uproot the problem is to fix it at protocol level once and for all. The most promising new concept in that direction and a highly intriguing one at that is described in an experimental IETF draft. It entails the introduction of a new resource record in DNS servers called RMX, Reverse Mail Exchange, to aid the recognition of forged sender addresses. The idea is brilliantly simple: the RMX DNS RR lists legit sender's IPs for mail being sent from this domain. When a mail server receives a connection, it compares the originating IP with the list given by the RMX RR for the MAIL FROM domain of delivered mail. Mail that fails this check is discarded as illegitimate.

The extent of this scheme's brilliance is hard to summarize. In one simple step, forged sender addresses become a thing of the past. It is much simpler than any cryptographical authentication scheme proposed to date and at least as robust as any of them. Unlike them it retains much of the anonymity of mail as we know it. All the necessary infrastructure already exists (a huge bonus).

But in the meantime, we have to find ways to keep spam out of the inbox without any support on the technical level. Half a dozen years' worth of experience in this area suggests that the only viable approach is to win without fighting. The approach is to take the route known as the only sensible one in security: deny by default, permit explicitly. Obviously using a whitelist is no new idea. Innovation comes into play by adding flexible recognition for solicited bulk email such as mailinglist traffic, and use of a spam filter's scoring mechanism to rank the the rejects. They're put in a grey (as in, almost black) box sorted by ascending spam score so that legitimate mail sorts to the top. The whitelist is updated by bouncing legitimate rejected mail to a special address, or by using a special bind in your mailer. All you need to do then, is skim the top of the greybox once in a while for legitimate mail and bounce out the keepers.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re: (OT) Fighting spam (use a layered defense)
by grinder (Bishop) on Nov 16, 2003 at 13:58 UTC

    RMX, like DSP (Designated Sender Protocol) won't work. Some smarter people than I have already commented on the issue.

    Let me add my own observations. RMX and DSP require everyone to participate for it to work. If some people don't bother to implement it, and you wish to receive mail from them, then you have to special-case them. That takes effort. It doesn't work for me, now.

    Given the current state of affairs, with the pitiful levels of adherence of the current recommendations, it is illusory to believe that people would implement the new recommendations correctly.

    Today, I see people running SMTP servers with incorrect or absent reverse DNS records (PTRs). I see people with MX records that point to CNAMEs, or worse, numeric IP addresses.

    I see people connecting to my servers with my IP address, or my domain name, in their HELO string. I see hotmail servers connecting to me with "HELO hotmail.com", rather than giving the FQDN of the machine. Which makes it harder to stop forged hotmail.com messages. If everyone respected the current RFCs (and reading the recommendations as s/should/must/g ) things would already be a whole lot better. Until then, there's not much point adding one more damned thing to go wrong into the picture.

    I also see people connecting to me with "HELO yahoo.com" or "HELO compuserve.com". And no legitimate SMTP server from these domains announce themselves that way. So I can block them, and reject their e-mail, right up front, before I see their data.

    I block 90% of the incoming spew merely by running simple correlation checks against the envelope (the HELO, the MAIL FROM and the RCPT TO). I delete a bit more by examining the subject line. Send me a message with a subject of "Hi" and you'll get a bounce "only spammers say 'hi'". A message with 10 or more consecutive spaces is also grounds for rejection. I refuse connections from ADSL/cable dialups and similar residential addresses.

    With that in place, a trickle of spam still comes through. That can be caught with content-filtering. While the spam in Andy Lester's example fools Bayesian scoring, it won't fool Markov chain analysis. The odds of find the word stream "fixed for rough pencil final happy" in a legitimate message are as close to zero as there is precision in current hardware floating point implementations. (And you are of course not subjected your usual group of servers you exchange messages with to these rules, are you? If a friend wants to joke with me about how I should enlarge my penis, I want to hear about it).

    Adaptive blacklists, like Vipul's Razor, and greylisting are other techniques worth investigating. I don't really care to win the spam battle, I just want to make it not worth a spammer's time to try and send me their spew. If enough people do that, it will be enough.

      Noone is interested in the battle with spam.. we all just want a clean inbox. :)

      I agree with most of your points, and I know the weakness of requiring everyone to participate for RMX based defense to work. Still, if it was relied on strictly enough by a significant enough portion of the internet, the pressure to get your RMX RR right or perish would be significant. Even if only the large mail hubs (Hotmail, Yahoo and the many other freemailers) which are frequently used as forged senders implemented this (on both directions, their own RMX RR as well as requiring them from senders) that would be a step forward.

      A problem in general is that non-adherence to protocols is not currently punished (enough); which means neither spammers nor half the population of the internet make any effort to adhere. However, even if adherence were enforced, it still wouldn't be that hard to forge a sender address - which is where RMX comes in.

      Makeshifts last the longest.

        we all just want a clean inbox

        Ah, but we also want to retain the ability to receive legit mail from anyone, even people we've never got mail from before. (I have content on my personal website about puppetry, and about constructing puppet stages. I receive email from arbitrary people who found it in a web search, and wanted additional info about a particular facet of it, on a semi-regular basis. I don't want to make these people jump through extra hoops (web-based "mail" forms and similar) to contact me. Also I maintain a usenet FAQ (though I get fewer questions about that since it's an obscure one). Also, it seems wrong to penalize legitimate people who want to contact me, because of the abuses of a few utter losers.

        Still, if it was relied on strictly enough by a significant enough portion of the internet, the pressure to get your RMX RR right or perish would be significant. Even if only the large mail hubs (Hotmail, Yahoo and the many other freemailers) which are frequently used as forged senders implemented this (on both directions, their own RMX RR as well as requiring them from senders) that would be a step forward.

        You're daydreaming. The chances of a major ISP of any kind agreeing to reject possibly legitimate incomming mail because it doesn't comply with some new standard are roughly the same as the chances of Microsoft releasing the complete source code for the current version of Office under the BSD license, or Macromedia producing a useful piece of software.


        $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
      I think proposals such as DSP are quite promising. (I'm not sure it's 'there' in its current proposal though). The only *real* problem with DSP is that some (relatively few, in reality) people want to be able to send mail from one domain through a non-related ISP's SMTP server. Now, a lot of ISPs are blocking this type of use in any case nowadays in an attempt to reduce spam being sent through their servers. Also, if you say that 'I must be able to send mail from any domain through any SMTP server I have access to', then you're essentially removing any possibility of a protocol based attack on spam - as this is ALL that spammers do, that is always detectable. You don't NEED to have everyone using something like DSP to let it help - you can feed it as another variable into a multi-layered spam filtering system. (eg if a message comes from a non-blocked DSP compatible domain, you could automatically white-list it, otherwise do your content filtering etc) You could be cruel, and encourage people to implement it by modifying it so that only people who implement it themselves are allowed to use it :-) And, it's not hard to implement at the DNS level. Many mail servers have something similar built in for RBL lookups etc, so it wouldn't be hard for them to modify that. Once someone implements it, then, yes, they could send spam through their own servers, and a crude DSP check would allow it, but it would then be easier to block as you'd have easier checks to put in place as you'd know the email domain the message was coming from.
Re: (OT) Fighting spam
by liz (Monsignor) on Nov 16, 2003 at 11:24 UTC
    The idea is brilliantly simple: the RMX DNS RR lists legit sender's IPs for mail being sent from this domain. When a mail server receives a connection, it compares the originating IP with the list given by the RMX RR for the MAIL FROM domain of delivered mail. Mail that fails this check is discarded as illegitimate.

    I like the idea. But the paranoid side in me wonders whether this a good thing in the long run as it will turn the attention of professional spammers towards DNS. DNS as it currently stands, is relatively easy to fake in any upstream server. Heck, I've seen it faked inside a LAN (just a matter of getting your answer on the wire before the "real" nameserver replies). And with the right TTL's, the wrong information is going to stay there for a long time.

    Before anything like this is tried to fight spam, I think it would be more important to get secure DNS accepted worldwide. That at least should make it a lot harder for spammers to start messing with DNS to get their spam sent through if the RMX DNS RR scheme would be gaining wider acceptance.

    Liz

Re: (OT) Fighting spam
by vacant (Pilgrim) on Nov 16, 2003 at 19:00 UTC
    I'd like to put in a couple of short comments FWIW, the first being sublime optimism, and the second being raving paranoia.

    First, the optimism: I have been astounded at the effectiveness of the "Naive Bayesian Filtering" I have observed both in the Mozilla filter, and from some fiddling around I have done. I can't believe this method has gone from "amazingly effective" to "dead end" in a few short months. It is called "naive" because it depends entirely upon the fundamental statistical method it uses. Suppose one were to add a dictionary (or even a heuristic) to recognize random words and nonsense words, or words with two letters transposed? Add, for instance, one or more weighted tokens to the statistical tables that are a function of this added analysis. It should add, I believe, about as much overhead as a spelling check. I think it is far too soon to give up on this simple, cheap(!), unobtrusive, uninvasive, and so far effective method of self defense. It might also incite senders of email to improve their speling skills.

    Now the paranoia: I have watched the Internet progress from a joyful, open, friendly world-wide community in the direction of a wholly-owned means of delivering commercial dumbth, just like television, but with a reverse channel for credit card payments. It is progressing distressing quickly. Every time some scammer or opportunist pulls another fast one, the Internet gets more difficult, more paranoid, more regulated, more complicated, and more favorable to corporations with big budgets and less favorable to everyone else. The deterioriation results less from the bad guys than from the reaction to the bad guys.

    Let us be very careful about adding complexity. Each new wrinkle makes it more difficult for the public, the amateurs, the open source contributors to compete with the enormously wealthy folks who want to take the Internet away from us. Worse, the more complex the system, no matter how well-intentioned, the more opportunities there are for the black-hats to exploit. Spammers, fraudsters, and panderers are going to continue to thrive on the 'net just as they do IRL. There will continue to be thousands of hijacked consumer appliances as long as crappy software is cheaper to produce than solid software, and there will always be those who respond to the junk email, because a certain portion of the population is going to continue to be credulous where they should be paranoid, like yours truly. Let us continue to oppose the exploiters, but very, very carefully.

      I have to think you haven't quite understood how Bayesian filtering works. The stuff you're talking about (random words, transpositions) already makes an impact in your statisticts. In fact, it is better not to put them in the "correct" bucket, because as Paul Graham noted, where a spammer may try to subvert rule based filters with "vi.agra" instead of "viagra", the former will get marked as a 100% indicator for spam, where the latter might have been innocent. Likewise goes for random words.

      As for the added complexity, it is not much complexity to add here at all. That's what's so appealing about it to me. There is no fundamental change in the way mail works with this scheme, as opposed to many others proposed so far. And I have a hard time following the argumentation that complexity necessarily makes a system easier to exploit. Taint checks make a program more complex, too. Encryption adds complexity, but I'm sure noone uses telnet for remote shells over the internet anymore. Complexity is not evil by itself - that's much too simplistic a world view. Everything should be as simple as possible, but no simpler (to invoke a well known quotation).

      Makeshifts last the longest.

        In fact, it is better not to put them in the "correct" bucket, because as Paul Graham noted, where a spammer may try to subvert rule based filters with "vi.agra" instead of "viagra", the former will get marked as a 100% indicator for spam, where the latter might have been innocent.

        The problem with this is, there are too many ways to mangle a word such as "viagra". I've seen fifty or so variations already.

        This is basic arithmetic: if there are four ways to do v, four ways to do a, eight ways to do i, seven places to add extra character(s), and a large number of different combinations of extra characters that can be added (any combination of punctuation, for example; I've also seen "creme" on the end, and I'm sure there are other possibilities), that makes 4*4*8*7*n different ways to spell the word, where n is a large number. Repeat for other popular drugs (vicodin gets spelled even more creatively, for example). Add to this the threshhold on how many times a word has to occur to be interesting, and just the order-prescription-drugs spammers alone will be sending you several *million* messages before your naive bayesian filters become effective.

        This is only true for the serious hardcore mutating spam, the stuff that's always sent from Asia so as to be utterly untraceable, the stuff that gets a whole new subnet every month or so, the stuff that mutates every single aspect of the headers with just about every single message. However, since that stuff is most of the spam I get...

        The only thing that's consistent about this stuff is that the IP address from which it's sent never EVER has a PTR record in in-addr.arpa space. If I ran my own mail server, the first thing I would want to implement is a ticket-verification scheme for messages sent from hosts without proper reverse DNS. 99% of the legit mail comes from a host with a proper PTR record, and that mail would be undelayed. The rest would go through one of those one-time verification systems wherein each sender would have to respond once to a verification probe and then would be whitelisted. (Of course, if everyone did this the scumbags would probably arrange to be a domain registrar so that it would cost them little or nothing to burn a domain for each batch of spam...)

        See, this is the problem with Paul Graham's approach: the spammers are busy thinking about circumvention, an issue that he ignores completely. If we want to stop spammers from getting through our filters, we're going to have to be more thorough about our approach, in terms of predicting and preventing simple attacks. Naive bayesian filtering eats flaming death when the spammers switch from plain language to euphemism and throw in some Markov chains (thirty-year-old technology). I predicted this within five minutes after I read Paul Graham's original article on the topic. Sure enough, when I tried out ifile (seeded with thousands of messages in each category), it was maybe 75% effective, making errors in both directions -- useless. It was admittedly very good at filtering out the simplistic spam, especially things like 419 spam, but if failed miserably on the hard stuff. A simple technique is not going to solve the matter. The spammers combine techniques. Lots of techniques. We need to combine techniques as well. We need to apply regex technology, so that "moster rod" and "M0n-stur R0>" are the same phrase or at least considered very similar, and then we need to look at not just individual words but phrases, combinations of certain words together in close proximity to one another, and so forth, so that "M0n-stur R0>" scores as a close match to "Turn your rod into a monster." (Yeah, more CPU time. So be it. CPU time is cheaper than my time and cheaper than my bandwidth, too.) In short, our filters need to be less naive, need to combine various techniques. Can bayesian analysis help? Sure. Can it do the job by itself? No. Can regular expressions do the job? No. But they can help...


        $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: (OT) Fighting spam
by pg (Canon) on Nov 16, 2003 at 20:20 UTC

    Look at this from a micro-view, the new protocol surely helps. Look at the entire thing from a macro-view, the entropy of the cyber world is increasing all the time, and this process cannot be stopped.

    Wait for 20 years, you will be bothered by more unsolicited information then than now, not less. If those unsolicited information do not appear in this face, they will show up in that face.

    I am not too frustrated about this picture. The purpose of internet itself is to increase the human contact and make it easier. The right expectation is that, negative things always accompany those positive ones.

    Of course we should try our best to stop those spammers, but at the same time, should set the right expectation.

Re: (OT) Fighting spam
by woolfy (Chaplain) on Nov 17, 2003 at 10:07 UTC
    Spammers are not all stupid, even though most of them are. And even the stupid ones, certainly the ones with adequate funding, have enough brains to hire people that are technically in-the-know regarding preventing spam being filtered, one way or the other. Everything that is invented by the antispammers can and will be counteracted by the spammers, sooner or later.

    As with many other things it all boils down to human behavior, and sometimes that needs to be supported.

    The safety of a nuclear power plant depends not only on technical provisions, rules and training, but mostly on the way humans deal with the techniques, rules and there everyday work. If they behave badly, we have Tchernobyl.

    In traffic, safety depends on well-designed and well-built cars, good roads, traffic lights and other traffic regulating objects, and of course rules and training. The hugh number of deaths in traffic show that a lot of humans don't care about safety as much as we could hope for.

    In a lot of cases, governments make up a lot of rules to increase safety. People who break those rules can get fined or go to jail (or have to do community service, undergo therapy etc).

    The Dutch parliament is working on amandments to the telecommunications bill which will make sending spam illegal, and spammers have to prove for each and every email address they are sending spam to that they have the owner's permission to send spam. The new rules also have to be approved by the Dutch senate. The new rules are not as strict against spam as antispammers wanted or hoped for, but it is an important step in the good direction. The fines for spammers can be high, like over half a million dollars.

    If more countries follow this example, and if all countries that have bills like this also work on prosecution of spammers, spamming will no longer be as profitable as it seems to be now. Enter the spam police and spam detectives. Maybe we will see a new profession: the spam bounty hunter. Hunt them down, get hot proof, arrest them, deliver them to justice and "hang" them in public.

    On the other hand, we have to be careful with all possible rules and technical innovations: anonimity in many aspects in life can be good and must be protected. I don't want to pay everything with a credit card or bank card, I don't want the government or whoever to know my every move, my every word, my every wish. The world is still not the world depicted in 1984, but in some aspects it already is worse than Orwell every imagined. Please be careful with our freedom of mind, speech and actions, wherein anonimity can be a last resort.

    Therefore I insist: the most important aspect is still the human aspect. As long as people react to spam, buy things from spammers, do business with spammers, it is profitable to spam. Just like speeding in traffic or not caring for pressing the right buttons in a nuclear power plant: stupidity and malice on the internet are just human, and rules, technical innovations and provisions, fines and whatever we can think of more, do not improve safety, happiness, quality of life.

    But still, we must try. For my part, I'm trying to help myself first. I never react to spam. I've got a lot of email filters, and most of the spam I receive (30 to 100 a day) ends up in the spam bin, unseen and unread by me. I'm working towards a negative mail filter setup: only mail whose sender I know, will be received; all other mail is spam.

Re: (OT) Fighting spam
by zentara (Archbishop) on Nov 16, 2003 at 22:34 UTC
    I don't have the experience at running large mail servers that some of you have; but I would like to comment on RMX. I've been seeing various new user-space programs popping up which do similar things. They seem to work by the user keeping a database of acceptable mail senders. The user dosn't need to do anything except download their mail with this program. If the program finds the sender in the database, it accepts the mail. If it dosn't, it sends the sender a "confirmation request" with some unique id number. If they reply, they get added to the user's database. Thats the general operation. It seems that most well-managed maillists use this technique to filter out spam.

    So I'm expecting more popmail programs will start using this type of feature. It would be nice if there was a standard, so that the sender's mail program will recognize a "confirmation request" automatically and auto-send the confirmation, if it's valid. I'm sure there are alot of ways to hack and abuse this, but it has the potential of being a "pretty tight ship" and runs in user space independent of the ISP's.

      This is known under various names, the most common one being "tagged message queuing".

      And it sucks. Hard.

      The potential for abuse notwithstanding, it is completely unworkable for people who often get legitimate mail from strangers. These are the people who need of a way to filter spam reliably the worst. These are the people who cannot viably use a traditional whitelist. And these are the people for whom tagged message queuing means a manifold increase in mailtraffic. Because much of their traffic comes from as yet unknown sources, nearly every legit mail they get will require four actual messages to be sent (mail, confirmation request, confirmation, confirmation accept notice).

      If you want to kill your mailserver, tagged message queueing is the quickest and most reliable way to do so.

      Not to mention it's automated mail sending which means it needs to be configured carefully. I've seen people's message queuers junk mailinglists repeatedly because they were too stupid to set it up right.

      Of course it's also a giant pain in the bottocks for the legitimate senders of mail, but who cares, right?

      Makeshifts last the longest.

      Challenge response is horribly flawed. For a full explanation of why we don't want lots of people to use it, read this rant.

      And what happens if Jane doesn't like joe, so she sends bob, bill, jacob, and 50,000 other people on the millions addresses CD an email from "joe".

      If each of these other people used CR clients, joe's poor mailbox would be reduced to rubble.

      CR is the same as spam - cost shifting to another person. It should die a quick death.

        Well I'm not in the process of "brainstorming" the best protocol for all of this; but the mail clients of the 50000 other people should detect in the mail headers that it was from Jane, not Joe, and would ask Jane for a confirmation. If she dosn't provide it, then they are dropped to /dev/null. So Joe would never receive them.
Re: (OT) Fighting spam
by zakzebrowski (Curate) on Nov 17, 2003 at 14:05 UTC
    A simple technique which I'm using is that any mail that comes from any address not:
    • Family
    • Work domain (eg: foo.org)
    goes into a "probable-spam" folder. I review the folder about once a week. Sometimes I find addresses to white list, but often, it's only spam.


    ----
    Zak
    undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;
      Yep. That's the essence of the winning-without-fighting approach I mentioned. If you frequently had a couple of addresses to whitelist, instead of only very occasionally, you'd sort the probable-spam folder (which I called greybox) by spam score, and then you'd have exactly that approach.

      Makeshifts last the longest.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://307447]
Approved by gmax
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 06:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found