http://www.perlmonks.org?node_id=133023

Update

2002-04-09: The code has been updated to reflect changes in the SpamAssassin module. Any changes have been noted in the comments, and the code as it originally appeared has been retained in the comments as well.

Purpose

My purpose in writing this tutorial is not to extensively cover the capabilities of Mail::Audit or Mail::SpamAssassin. My purpose is to show how I implemented these tools in order to address my filtering needs. I believe that my needs are not unique, and therefore, I hope this tutorial proves valuable in providing a step-by-step guide to using Perl for your mail filtering requirements.

Acknowledgements

Much of the following is taken from the applicable CPAN pages and Simon Cozens' page, as well as a conglomeration of other pages, including the CPAN page for Mail::Procmail, which is not actually used here. To the authors of these pages, I am eternally grateful.

The History of My Problem (Or Identifying The Itch)

Recently, in the space of about 30 minutes, I received in excess of 15 email from the same person carrying the same subject and body. Over the course of the next week, I received over 200 such email. Over the following month . . . well, you get the picture.

I use fetchmail, Procmail and Mutt for processing my email. Naturally, fetchmail retrieves my email, Procmail filters my email into the appropriate folders, and Mutt reads my email.

Fetchmail serves my needs well, and I rarely have any complaints with Mutt. Procmail, however, is another story.

Procmail, for those who are unfamiliar, relies on recipies for filtering email. For example:

:0 * ^From: lll@hotmail\.com friends

will filter all email from lll@hotmail.com into the friends folder. This is known as a recipe, and must be called from ~/.procmailrc.

When I began experiencing the aforementioned flood of spam, I configured the following and placed it in my .procmailrc:

:0 * ^From:. mortgagef6e@canada\.com /dev/null

Naturally, I expected all email from mortgatef6e@canada.com to be routed into oblivion. However, for reasons I have yet to determine (and believe me when I say that I worked long and hard on this problem), it did not work. I continued to experience the flood of email into my incoming folder.

The real reason I couldn't solve this problem is because, in my oh-so-humble opinion, Procmail's receipes are too damned difficult. Navigating this maze is the equivalent of a 4-hour college course.

And, frankly, filtering email just shouldn't be that difficult.

Identifying the Solution (Or Finding a Back-Scratcher)

So, I began searching for a solution. I mean, I know enough Perl to get through the day. And Perl does excel at pattern matching. And filtering email is nothing more than pattern matching, right?

A quick search of CPAN led me to Mail::Audit. Perfect! A Perl module to filter email. Additionally, the author has provided a fairly detailed example of using Mail::Audit.

As I began writing the script, it quickly became obvious that, while I could easily identify and route my email, I had no mechanism for filtering spam. And, the repetitious email that started this whole adventure was spam. I needed a solution which would allow me to separate the spam from my legitimate email, while filtering my legitimate email into the appropriate folders. I certainly did not want to reinvent the wheel and have to figure out all of the patterns and tricks of the trade used by spammers. Following tilly's advice that I should assume that Perl has what I want, I once again hit CPAN.

A little more searching lead me to Mail::SpamAssassin, a plugin for Mail::Audit that has a very high success rate for filtering spam. SpamAssassin is available as a command-line utility, a daemon, and (obviously) a perl module.

Now all I needed was to modify my script to bring these pieces together.

The Script (Or Scratching The Itch)

Following is the commented script in its entirety.

#!/usr/bin/perl -w # # program: filter.pl # description: filters email into appropriate folders use diagnostics; use strict; use Mail::Audit; use Mail::SpamAssassin; # # The default mailbox for delivery # my $default = "/var/spool/mail/".getpwuid($>) is also an option. # However, I keep all of my email in ~/mail. Additionally, while I # have a ~/mail/mbox, I route *all* of my email to a specific # folder. my mbox should never contain any email, and only exists # for asthetic reasons. # my $folder = "$ENV{HOME}/mail/"; # #################################################################### # Filter spam first # We knock the spam out of the way immediately. This saves us from # wasting time processing mail which is obviously spam. # # Spam is swept to its own folder, $ENV{HOME}/mail/spam.incoming. # Mail::SpamAssassin will prepend *****SPAM***** to the subject line # of the email. Additionally, it prepends something similar to the # following paragraph to the body of the email (this, as well as # pattern matches, can be modified by editing the # spamassassin.cf file): # # SPAM: -------------------- Start SpamAssassin results ------------- # SPAM: This mail is probably spam. The original message has been alt +ered # SPAM: so you can recognise or block similar unwanted mail in future, + using # SPAM: the built-in mail filtering support in your mail reader. # SPAM: # SPAM: Content analysis details: (7.9 hits, 5 required) # SPAM: Hit! (2.1 points) BODY: /http\:\/\/\d+\.\d+\.\d+\.\d+\//is # SPAM: Hit! (2.5 points) BODY: Link to a URL containing "remove" # SPAM: Hit! (3.3 points) BODY: /click here.{0,100}<\/a>/is # SPAM: # SPAM: -------------------- End of SpamAssassin results ------------ #################################################################### # # This statement gets the next email from the queue # # my $item = Mail::SpamAssassin::MyMailAudit->new(); # The above line is the original code. MyMailAudit no # longer exists, so we rely on Mail::Audit to retrieve # the next email from the queue: # my $item = Mail::Audit->new(); # # This statement sets up our handle to SpamAssassin # my $spamtest = Mail::SpamAssassin->new(); # # Now we retrieve the status to determine whether the email is, # in fact, spam # my $status = $spamtest->check ($item); # # If the email is spam, write the email back with the aforementioned # subject and body modifications, then call the spam() subroutine # for processing (see end of script). # if ($status->is_spam ()) { $status->rewrite_mail (); spam("SpamAssassin",$folder); } #################################################################### # Mail::Audint initialization stuff #################################################################### # # If we get here, Spam::Assassin did not identify the email as spam # # Specify the location of our log file. We'll be writing several # status messages here. # open (LOG, ">$ENV{HOME}/syslog/.audit_log"); # # Get relevant fields from the message. These are pretty # self-explanatory. # my $from = $item->from(); my $to = $item->to(); my $cc = $item->cc(); my $subject = $item->subject(); my $body = $item->body(); chomp($from, $to, $cc, $subject); #################################################################### # Note that we just retrieved $body. Although I # don't use it here, this provides the ability to # filter based on the content of the body of the # email. For example: # # if ($body =~ /some_pattern/i) { #do stuff }; #################################################################### # # Start logging. # print LOG ("From: $from\n"); print LOG ("To: $to\n"); print LOG ("Subject: $subject\n"); #################################################################### # End initialization stuff #################################################################### # I know certain people. We all do. They're L-O-S-E-R-S. And, # frankly, I don't enjoy receiving email from them. The following # will identify these email addresses and route them immediately to # my trash folder (via the trash() subroutine). # for (qw(gar079@yahoo.ca badguy@loser.net nasty@whimp.org enemy@hate-u. +com)) { if ($from =~ /$_/) { trash("From a loser",$folder); } } # I have some programs that email me from various machines. I want # these email to be immediately routed to ~/mail/home. # if ($from =~ /\@exitwound.org/i) { $item->accept("$folder"."home"); } # Now we come to email lists and people who commonly send me email # (hi Mom!). First, we set up a hash. The key is a pattern to be # matched against the From: line. The content is the folder name # where the mail should be stored. # my %lists = ( "apache" => "apache", "buckaroo" => "buckaroo", "christianhusbands" => "christian", "kde-linux" => "kde", "lawtech" => "lawtech", "debian-user" => "linux", "linux" => "linux", "win4lin" => "linux", "lll\@hotmail" => "Lori", "perlbot" => "perlbot", "dynamite" => "metal", "80s_Rock_Metal" => "metal", "metal" => "metal", "screamsofabel" => "metal", "mavericks" => "MomDad", "hargrojj" => "MomDad", "mutt" => "mutt", "rl2" => "rl2", "focus-linux" => "security", ); # Here, we compare the From: field with each key of the hash and # store the email in the corresponding folder # for my $pattern (keys %lists) { if (($from =~ /$pattern/i) or ($to =~ /$pattern/i) or ($cc =~ /$pattern/i)) { $item->accept("$folder"."$lists{$pattern}"); } } # The following code checks whether the To: or CC: field contains the # phrase "shock." If not, it means that the email is being sent to a # list (which has not been identified in the previous section). # Therefore, if my email address is not in the To: or CC: field, I # assume that it is spam # if ($from !~ /shock/i and $cc !~ /shock/i) { spam("Apparently not to me",$folder); } # If we've made it this far, I'm not sure what it is. Therefore, I # store it in the Bulk folder. # $item->accept("$folder"."Bulk"); # Bye-bye # exit; ################ Subroutines ################ # # This subroutine handles anything identified as spam. It is called # thusly: # # spam("Reason for calling",$folder); # # The subroutine will store the email in the ~/mail/spam.incoming # folder. It will also print a message to the log file identifying: # # (1) The spam subroutine; # (2) The line number which called the spam subroutine; and # (3) The reason for calling (i.e. "Reason for calling"). # sub spam { my ($tag, $reason, $folder) = ("spam", @_); my $line = (caller(1))[2]; print LOG ("$tag [$line]: $reason\n"); $item->accept("$folder"."spam.incoming"); } # # This subroutine handles anything identified as trash. It is called # thusly: # # trash("Reason for calling",$folder); # # (1) The trash subroutine; # (2) The line number which called the trash subroutine; and # (3) The reason for calling (i.e. "Reason for calling"). # sub trash { my ($tag, $reason, %atts) = ("trash", @_); my $line = (caller(1))[2]; print LOG ("$tag [$line]: $reason\n"); $item->accept("$folder"."trash"); }
Procmail Modifications

The following modifications were necessary to my .procmailrc file in order to get this baby rolling. There may be better or more efficient ways to do this, and if so, I welcome the input.

# # The following will force all messages from Procmail to be logged in # ~/syslog/procmail # LOGFILE=$HOME/syslog/procmail # # Turn verbose logging and log abstract off, unless you're the # wordy type. # VERBOSE=off LOGABSTRACT=off # # From the procmailrc man page: # # By default, procmail returns an exitcode of zero (success) if it # successfully delivered the message or if the HOST variable was # misset and there were no more rcfiles on the command line; # otherwise it returns failure. Before doing so, procmail examines # the value of this variable. If it is set to a positive numeric # value, procmail will instead use that value as its exitcode. If # this variable is set but empty and TRAP is set, procmail will set # the exitcode to whatever the TRAP program returns. If this # variable is not set, procmail will set it shortly before calling # up the TRAP program. # # So, by setting EXITCODE to nothing, we can have procmail return # whatever exit code our filter.pl script determines is necessary. # EXITCODE= # # Point to our program to handle all of the filtering. As mentioned, # by running our program as a TRAP program (see the procmailrc docs # for more information about this). Procmail will assign the exit # code of our script to the MTA (sendmail, postfix, exim, etc.) that # called procmail. # TRAP=$HOME/bin/filter.pl # # The following is for safety purposes. All email is copied to this # file, so if something gets lost, you can retrieve it from here. # Once you're comfortable with your filter.pl, you can remove the # following two lines. :0: $HOME/syslog/mail

At this point, we're happening. A simple fetchmail -d 90 (or whatever), and we're good to go. fetchmail will retrieve the email, Procmail will receive it and invoke the fetch.pl script, which will filter the email accordingly.

Conclusion (Or The Itch Has Been Scratched)

I've been running this script for a few weeks now, and Spam::Assassin is proving to be very reliable. I'd estimate its accuracy somewhere around the high-90th percentile, and on many days, it's 100% accurate. In conjunction with the other filters I've added in the script, all spam that I receive is currently being filtered to the spam.incoming folder.

For me,

$item->accept("$folder"."home") if ($subject =~ /\@exitwound.org/i);

makes far more sense than

:0 * ^Subject:\/.*exitwound home

or whatever the hell the correct Procmail syntax might be. Who has time for that? Give me a good old Perl script any day. After all, filtering email just shouldn't be that hard.

Replies are listed 'Best First'.
(jcwren) Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin
by jcwren (Prior) on Dec 19, 2001 at 10:11 UTC

    Excellent tutorial.

    I would like to add a couple of things:

    • Mail::Audit 2.0 is broke. Sooner or later, your inbox will become corrupted. 1.11 is stable, and has given me no problems, however myself and two friends have had to back down from 2.0 to 1.11 to solve the inbox corruption problem. You can find Mail-Audit-1.11.tar.gz here (directory) or here (tarball).
    • The tutorial doesn't cover installing the Razor clients. These are necessary if you wish to make use of the Vipul database. This is the coolest part of Spam::Assassin, IMHO. A MD5 checksum of the mail is compared against a database of known spam. If it matches, it's automatically tossed. More importantly, as you get spam, you can cause it to be added to the database, which means other people never have to see it. The Razor::Clients package is not on CPAN, but is available here. Spam::Assassin automatically makes use of them if they are installed, otherwise it doesn't bother to mention it.
    • It is worth noting that when you are writing filters, once $item->accept() is called, the program ends. No further tests are included. The documentation says this, but it's not obvious at first glance. As such, while the subs in the example never return, it looks a little funky if you know this.
    • You can use the .procmailrc file, or, you can use the .forward file with the format | ~/mailscanner.pl Note that under certain systems, such as Redhat, sendmail runs programs under the rsh shell. To make this play, you have to put a symlink in /etc/smrsh to 'mailscanner.pl', or whatever you called your client. If you get a lot of mail, it avoids the small amount of additional overhead of spooling up procmail, only to pass it on.
    • This is a perl script. As such, when you make a change, you HAVE to 'perl -c mailscanner.pl' before walking away. If the scripts croaks, the MTA will send a reply to the originator of the email that the mail was undeliverable. When I was using procmail, a borked recipe was annoying, but not a problem. With Spam::Assassin, it's much more important to get it right.
    • It's important to put spam in a folder, and not drop it completely. Spam::Assassin isn't perfect, nor will your rules be. Mine are tuned pretty well, and rarely lets real spam through, but sometimes it kicks out good messages, because someone set a priority flag in Outlook, and had a few caps in the title. I get mail from a guy in Romania for product support on a C compiler that causes the problem. Frequently, I run tail -f ~/.audit_log in window somewhere, and keep an eye on what's rejecting as spam. As I see mail from people that I know I'll get again, I adjust the script, or easier, tune the .spamassassin.cf whitelist and blacklist (this files gets created automatically in your home directory the first time Spam::Assassin is run.)
    • There is an unsaid implication that the Vipul database will catch viruses. This may be the case for some, but it passed a Sircam laden message right on through. I scan the headers for the standard 'Snow White - The Real Story!' and a couple of others. Don't count on Spam::Assassin to protect you. Add your own countermeasures, and use standard anti-viral techniques, especially if you're going to be POP3/IMAP'ing the mail down to a Windows box.
    • procmail has a facility to check if the mail is of a certain size. This is something that's lacking in this package. Each line of the message is an array entry. If you want to know how long it is, you have to interate over the entire array, summing the length. This ought to be something the package provides as a method. I'm not sure what the implications of binary messages, attachements, etc are, so unlike my procmail recipes, I don't check for files of certain sizes.
    • After Spam::Assassin defangs mail (or rewrites the headers with the word SPAM everywhere), it is not clear at all if a message modified this way can or should be submitted to the Vipul database. I have found no clear answer on this, although I have not pursued it agressively. My personal policy is to only forward raw un-rewritten mails to Vipul, to make sure the MD5 checksum is for something people will actually get, and not a post-processed version. If someone knows the real answer, I'd like to know.

    I think that's all the major points of running this. It's a great system, and it has seriously cut back on the crap I see.

    --Chris

    e-mail jcwren

      Excellent response. To address some of your points:
      • When I downloaded Mail::Audit approximately 2 weeks ago, v1.11 was the only downloadable version. I did not know v2.0 existed, nor was I aware of the added "inbox corruption" feature. Searching CPAN now shows only v2.0 for download, so props for the heads-up.
      • The Razor::Clients package really needs to be on CPAN. The documentation is extremely sparse on this. Again, props for the URL to get this material. I just installed it and it seems to be working well.
      • My original draft mentioned that, once you $item->accept(), all processing stops. However, on proof-reading and fact-checking, I could not locate this fact in the documentation. (Of course, now that it's too late, it's screaming itself from the page...) But, yes, this makes for very efficient processing, because once you've filed the email, you can forget about it.
      • I agree that all email should be filed and nothing dropped. Email filtering will never be "perfect" because the patterns are always in flux. Sooner or later, something would be lost in the void. That's why I emphasized the default of (in my case) the Bulk folder. If nothing else fits, it defaults to ~/mail/Bulk.
      • Spam::Assassin, implications aside, is not a virus scanner. It's a spam filter. While it may catch some viri, I think it foolhardy to rely upon it for anything other than filtering spam.
      • Again, the documentation is sparse concerning what should be submitted to the Vipul database. I don't know whether they have filters to "un-filter" the SPAM messages. Hopefully someone else will have a more definitive word on this. Until then, I think your suggestion is right way to go.

      Thanks for the great info, jcwren.

      If things get any worse, I'll have to ask you to stop helping me.

      How does version 2.0 corrupt the Inbox?

      I can't see any reported bugs for version two on rt.cpan.org. It might be worthwhile reporting the bug.

      simon

      Update (Jan-7-02): A bug and patch was submitted yesterday, I am not sure if it is the same one that jcwren speaks of. I'd imagine that the next version will fix this now Simon is aware of it.

Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin
by davorg (Chancellor) on Dec 19, 2001 at 14:33 UTC

    Have you looked at Mail::ListDetector? It works in conjunction with Mail::Audit and automatically detects mailing list messages and puts them in the right folder.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

That's great if you're talking a mail server... (was Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin)
by atcroft (Abbot) on Jul 09, 2002 at 02:32 UTC

    Great tutorial, but what if you're one of those people who aren't quite lucky enough to be able to run a mailserver to handle the mail? Is there a good way, such as a client/agent-like program that could go out and do much of the same functionality, deleting the message or leaving it, possibly sending a rejection message? I found several good modules for filtering, but it seems most all of them require local delivery....

    Any suggestions? guidance? hints?

    Update: hossman, thank you for responding. My experience with fetchmail has been to download mail to a local machine that could act like a mailserver, even if it wasn't visible to the outside world, although I may be mistaken in assuming it was limited only to that functionality.... I was looking for something that could possibly even work for those in windows or mac who don't have the capability, but will look again at the tutorial and docs. Again, my thanks for your reply and attention.

    Update: FoxtrotUniform, your response does help. I was not aware of that aspect of Mail::Audit, although I knew about several of the POP3-related modules. My idea was more along the lines of a stand-alone agent that would only delete , delete w/ a rejection, or leave in place for acceptance that could run at intervals... if that would be a useful idea....

      I'm not sure exactly how much this helps you, but Mail::Audit takes its input on STDIN, and the two usual ways for remote clients to talk to mail servers, IMAP and POP3, are represented in CPAN. It shouldn't be difficult to write a Perl script to grab your mail and run it through whatever you please... the difficulty, to my mind, would be getting your mailreader to play nicely with Mail::Audit, which AFAIK only writes to various Unixy formats of mailbox.

      Update: You should be able to download a copy of each message via IMAP, filter it through Mail::Audit and/or friends, and based on the results keep it, move it around, or delete it, also via IMAP (or more specifically, via Mail::IMAPClient).

      --
      The hell with paco, vote for Erudil!
      :wq

      Nothing I saw in the tutorial requires you to be running a mail server, the author even mentions using fetchmail, which is a POP client for pulling email off of a remote server. You can configure fetchmail to POP mail on regular intervals and pass it to any filtering program you want.
Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin
by Aristotle (Chancellor) on Jul 09, 2002 at 15:40 UTC
    One thing to note: I recommend making extensive use of eval wrapping blocks when using Mail::Audit. This way, you can catch any mishaps and still deliver the mail to a standard inbox. If you're not going through procmail, on some mailservers a failing Mail::Audit filter (I had poorly tested mine for whether the permissions allowed the smptd to run it f.ex) will result in the mail making a trip to the bit bucket - not what you want. It shouldn't be necessary in an ideal world to do so, but generously sprinkling evals followed $mail->accept($default_inbox); all over the place will protect your mail from boneheaded Monday morning mishaps and won't do any harm in other cases.

    Makeshifts last the longest.

Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin
by thecap (Initiate) on Feb 02, 2004 at 21:50 UTC
    SpamAssassin is dropping support for Mail::Audit. :-(

    I think we will need to change the way we call SA from Mail::Audit scripts to something like the following UNTESTED code:

    # Catch spam my $spamtest = Mail::SpamAssassin->new( ); my $status = $spamtest->check($mail); my $sa_mail_obj = $status->rewrite_mail (); # Fix needed with SA 2.70 $mail = Mail::Audit->new( data => $sa_mail_obj->get_pristine() ); if( $status->is_spam() ) { $mail->accept(MAIL_DIR.'Lists/JunkSpamAssassin'); }
    updated 2004-02-04

    The author of Mail::Audit stopped supporting it and released Email::Filter.

Re: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin
by Anonymous Monk on Mar 18, 2002 at 16:04 UTC
    I agree that Perl-based filtering is far nicer than procmail's horrid recipes, which is why I wrote my own Perl filter, parp. (Please note that I am currently making some small changes to the filtering API which will significantly simplify it.)

    I am obviously horribly biased, but while Mail::Audit is a pretty decent quick hack, there are a number of things I dislike about its approach, and a number of advantages that parp has over it (daemon/queue mode, folder filtering mode, more powerful OO-based design, automatic regression testing, dynamic learning of trusted addresses, and lots more). I am also intending to provide support for Spam::Assassin in the near future (although I'm still not convinced that SpamAssassin's statistical approach is as flexible in principle as parp's builtin spam-detection heuristics).