No such thing as a small change

Parse Email

by Anonymous Monk
on Jun 09, 2002

I need to write some code that will parse e-mails and wanted to get some user input as what the best approach is.

Here's the scenario. I will have lots of e-mails coming to my address and need to extract the body information from the e-mails and dump the info into a database. I know how to parse information and interact with databases so that is not a problem.

I'm wondering what the best approach is for giving my parsing script access to the e-mail. One way I see is to simply use Net::POP3 to check my e-mail and then parse each e-mail messaage accordingly. However, I'm not sure how efficient this would be if the volume of e-mail was large. This would require constantly polling the inbox with Net::POP3. Would it be faster to have the e-mails dumped into a file on my computer and then have my parse script parse the file?

Could anybody suggest a method for "linking" my script with my emails? Perhaps there are some programs or modules already out there.


Re: Parse Email
by Aristotle (Chancellor) on Jun 09, 2002 at 20:13 UTC
    You could have procmail(1) feed all or only certain incoming mails to your script - that would work without any inherent polling. Better yet, you could replace procmail with Mail::Audit and plug your own stuff right into that. The latter is probably the best solution; if you need some extra help with your mail mangling (unlikely), check out the myriad of Mail::* modules. <p align="right>Makeshifts last the longest.

      I was reviewing Mail::Audit and sounds like it could fit the bill. It included a program called popread. If I understand correctly, this polls my pop mailbox and feeds my e-mail to Mail::Audit, right?

      Also, in the future if I wanted to setup my own mailserver, would it be easy to pipe the e-mail from the mailserver directly to Mail::Audit?

        Yes and yes. The latter is actually Mail::Audit's raison d'être.

        Makeshifts last the longest.

Re: Parse Email
by Corion (Pope) on Jun 09, 2002 at 20:14 UTC

    To me it seems you want to do the following steps :

    1. Fetch email from POP3 server (done via Net::POP3)
    2. Parse email, extract headers, divide it into attachments (to be determined)
    3. Stuff the data into the database (presumably via DBI
    I work with Mail::Box to glue together some local mail delivery, and Mail::Box has some nice features to parse raw email text into header, body and attachments - I recommend you take a look at it. If you want to do the parsing yourself, you should look at how Mail::Box parses emails and you will need many of the prerequisites of Mail::Box anyway :-).

    As for the mechanism how to get at the emails themselves, I consider polling the POP3 server every 5 or 2 minutes often enough - if you want faster delivery, you will need to set up your own mailserver and have the mail delivered to your host directly.

      Yes those three steps accurately describe what I want to do. However, when I say parse e-mail, I mean at a very high level. I just want to extract the From information and some information in the body of the message. The e-mails won't have any attachments. It sounds like I could use Mail::Box to cleanly extract the body of the message and then analyze the content on my own.

      How do you pass messages to Mail::BOx assuming I use Net::POP3?
      Thinking for the future, if I were to setup my own mailserver is it easy to pass the messages to Mail::Box?

        Here is what I'm currently toying with - it's by no means complete and tested out, but it takes messages from my POP3 server, and stuffs them into a local maildir directory - the "local maildir" part would have to be implemented by you, as you wouldn't store your messages in one of the formats provided by Mail::Box :

        #!/usr/bin/perl -w # Some vestige of local delivery # For another method, have a look at Mail::LocalDelivery use strict; use Net::POP3; use Mail::Box; use Mail::Box::Manager; use Mail::Message; use Mail::Message::Construct; # Use ~/.netrc to determine pop3 login and password use vars qw($host $localmailbase $foldername); use vars qw($pop3); use vars qw($mgr $folder); $host = ''; $localmailbase = "/home/corion/mail/"; $foldername = "informatik"; $mgr = Mail::Box::Manager->new(folderdir => $localmailbase, default_folder_type => 'maildir', ); $folder = $mgr->open( folder => $foldername, access => 'rw', create => + 1 ); die "Couldn't open mailfolder '$foldername' : $!\n" unless $folder; print "Using folder ",$folder->name,"\n"; my %messageIDs; %messageIDs = map { $_->get("Message-ID"), $_ } ($folder->messages); $pop3 = Net::POP3->new($host); my $total = $pop3->login; die "Error logging into $host" unless defined $total; # No news is good news exit unless $total > 0; my $firstunread = $pop3->last() +1; for my $message ($firstunread..$total) { print "Message $message\n"; my $message_lines = $pop3->get($message); #local *F; #open F, "< inmail" or die "Couldn't read test message 'inmail' : $! +\n"; #my $message_store = Mail::Message->read(\*F); my $message_store = Mail::Message->read($message_lines); if ($message_store) { if (defined $messageIDs{$message_store->get("Message-ID")}) { print "Duplicate, rejected ",$message_store->get("subject"),"\n" +; } else { print "Subject:",$message_store->get("subject"),"\n"; $folder->addMessage( $message_store ); }; }; #close F; }; die "Couldn't save '$foldername' : $!\n" unless $folder = $folder->wri +te(); $mgr->close($folder);
