Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Spam filtering and regular expressions

by Mr. Lee (Scribe)
on Jul 30, 2005 at 14:20 UTC ( #479608=perlquestion: print w/replies, xml ) Need Help??

Mr. Lee has asked for the wisdom of the Perl Monks concerning the following question:

I have had not so good experience with ready-made spam filter software, but I cant help thinking about filtering rules when I get another variation of things CIAL15 V!A6RA etc. mail headers.

It should be possible to make something like a 733T5P33CH-Parser that transforms weird characters and punctuations above them to the most likely "normal" characters and then finally we can match the result against the suspicious keywords? Have any idea or experiences already with that?

Replies are listed 'Best First'.
Re: Spam filtering and regular expressions
by jhourcle (Prior) on Jul 30, 2005 at 14:46 UTC

    You might want to ask around at the spam tools mailing list. I used to read it religiously when I was responsibe for maintaining spam filters.

    I'm guessing someone's probably already done what you describe. If they haven't, I would probably handle it like soundex, but instead of grouping letters that sound like, grouping glyphs that look alike. (note, I specifically didn't say try to get them to the 'right' value, because the (0Oo) and (1lIi) distinctions are context sensitive ... (100K! M3ds @ lO% 0ff!), and the true meaning doesn't really matter, unless you're trying to determine if it's intentionally obfuscated, as opposed to just a suspicious keywords.)

    Oh... and UTF is going to make for a very, very large set of glpyhs.

      I agree with jhourcle's words:

      (...) distinctions are context sensitive (...)

      This is totally true - spammers know this fact and do use it to get around spam filters built this way. One approach we're looking at, tries to use a _capped_ number of replacement sets (ie, perform just 1 (one) to l (ell) transation at a time) and evaluate each of them against the regular expressions.

      The results we're getting with this are better than with just regular expressions, but not spectacular. There are more knobs to turn (how many replacements to perform and evaluate, what value should every match add to the score and what is the threshold, for instance) in addition to the set of regexes that are used to detect spam-flag phrases.

      A similar approach could be implemented using (hairy, IMHO) regexes. Those regexes are likely much harder to maintain and I guess they might be more expensive than the described approach. However, no testing has been done because we do not have a satisfactory solution to benchmark against yet.

      Oh... and UTF is going to make for a very, very large set of glpyhs.

      Indeed. This is why you must cap the amount of replacements to do when using this method.

      Best regards

      -lem, but some call me fokat

Re: Spam filtering and regular expressions
by Popcorn Dave (Abbot) on Jul 30, 2005 at 17:08 UTC
    Rather than relying solely on a regex, you may want to consider something that Thunderbird has an option for - disallowing e-mail from anyone not in an approved list of senders. I've got that option set and it has filtered down spam quite a bit.


    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.

      Whitelisting (only allowing e-mail from known good addresses), can reduce your spam significantly, but it doesn't deal with viruses, and has a rather high rate of false positives (rejecting e-mail that you would have wanted to see ... like maybe that friend from highschool that you've lost track of, or your friend telling you he's been fired from his job and had to switch e-mail addresses)

      The only advantage to acting on the e-mail addresses is that it (well, the envelope-sender, not necessarily what shows up in the 'from' header) is sent before the DATA command in SMTP, so you can reduce bandwidth used by rejecting early. (although, that only works for envelope-from and envelope-to ... and I'm guessing unless the system allows <> (the null e-mail address), you're not going to be losing messages about delivery failures.

      There are a wide variety of methods for attempting to determine if it's UCE, but most of them tend to only get the obvious stuff, or tend to be over greedy, and block legitimate mail. I agree that some regexes suck, but it takes many, many layers to do it well. (if you're going to go the regex rules, you might start by looking at the procmail rules from panix. I'd also recommend looking at spam-l and spam tools.

      I personally find that the best UCE indicator (ie, no false positives, except maybe on spam discussion lists) is when something is obfuscated (octal in IP addresses, HTML w/ hyperlinked urls that don't match the link, javascript to hide the content of the message, etc.)

      For most of my mail addresses, only allowing mail from senders in a whitelist, would not be an option, for I do want to receive mails from people that I do not already know. I regularly hand out the email address of our skating crew, and of course we get loads of spam, but also most valuable messages. And as most existing spam filtering mechanism provided by the popular web providers (like Spam Assassin) already filtered out newsletters that I had subscribed to, I only use filtering with a very high treshold, so maybe I have to live with loads of spam in my inbox. Sometimes it is even hard to figure out whether a message is spam or not, when looking at the title and sender as a human being, so how should an algorhythm get this right in every case? I think it is impossible as a matter of principle.

        Both you and jhourcle are right. Relying solely on a whitelist is not the way to go. What I was suggesting was to include that as part of a solution.

        I know that with Thunderbird, even though it's using a whitelist, I go through every time I d/l e-mail and see if something got filtered that shouldn't have. The program's learning is part of the solution as well, however it may be beyond the scope of what the op intends.

        Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Spam filtering and regular expressions
by cajun (Chaplain) on Jul 31, 2005 at 03:24 UTC
    I've been using SpamAssassin quite successfully for a while. I think part of the key to success might be using the right combination of rules. My spam has been reduced drastically. Maybe 2-3 spam messages actually make it into my mailbox daily. I save those, then add rules for them to my custom ruleset. YMMV

    As far as the regex you are looking for, you might get some ideas by looking at some of the rulesets at SARE Rules Emporium. There are many rules that deal with the various obfuscations of messages you're trying to keep out of your email.


Re: Spam filtering and regular expressions
by planetscape (Chancellor) on Jul 31, 2005 at 05:33 UTC

    Too bad there isn't a way to put Lingua::31337 into reverse gear...

Re: Spam filtering and regular expressions
by Zero_Flop (Pilgrim) on Jul 31, 2005 at 22:11 UTC
    I love PopFile. It uses Bayes to do most of the filtering, but at the same time it looks for odd letter combinations like V!A6RA to make sure they don't get past the bayes filter. It can also use magnets to get rid of spam with known strings you want to deal with.

    The group that maintains POPfile are realistic about the advanatges and disadvantages of bayes and as Spamers develop new tecniques they also expand popfile. I'm not a develper for it, I just have had great success with it.

      So, what is POPFile, anyway? I couldn't find it on CPAN. Any help would be appreciated. Thanks.
Re: Spam filtering and regular expressions
by spiritway (Vicar) on Jul 31, 2005 at 05:02 UTC

    Why trouble to identify the exact words? I think it would be simpler to just have the regex identify obfuscated text, without worrying about exactly what the word is. Look for bizarre arrangements, caps in the wrong places, digits in the words, etc. Maybe keep a list of words that are OK (such as Win98), or (say) module names.

    I would also suggest using a whitelist to identify messages from people you know, and pass them along without further processing. It would save you a bit of time, and also make subsequent testing more effective. People you know might use some suspicious words. Strangers should probably not be using obscenities or mentioning body parts in their first message... so it would be OK to filter strongly on those messages.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://479608]
Approved by Tanktalus
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (1)
As of 2022-07-02 02:06 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (102 votes). Check out past polls.