I have been using Netscape or Mozilla as my mail client since 1996, and I have built up a message archive with thousands of messages.

The way I have managed this archive, up to now, is to create folders within the Mozilla mail client. I then use the "Search Mail/News Messages" function in the mail client to find specific messages. This is not scaling well. Searches take a long time because each folder is stored as two text files:

I want to design an application that ingests my Mozilla mailbox, separates the messages into rows in a database, and provides much more robust and scalable search capabilities. I am considering using MySql as the database with Apache and mod_perl as the front end running on my local machine, a Linux laptop.

I am not asking for help identifying the Perl modules to parse mail out of a Mozilla mailbox. I think this was covered in a previous question I posted, Netscape/Mozilla Mailbox Processing. But, I do wonder if my fellow monks would mind commenting on:

  1. the general merits of the design idea that I've sketched out
  2. any "gotchas" they see in attempting to store email in a MySQL database, or rendering the body of the message in a dynamically generated web page
  3. practical ways to deal with any attachments included with the mails:
    • copy to a place in the file system, store a reference to the location in the database
    • embed the attachments as BLOBs in the database
Finally, if anyone knows of an Open Source program that provides 80 percent of this functionality, let me know. So far, I've identified SQmaiL (python) and Gmail (C). Neither is Perl, nor do they seem to be particularly active projects.


Dave Aiello
Chatham Township Data Corporation

Replies are listed 'Best First'.
Re: Managing a Personal Email Archive
by mojotoad (Monsignor) on Mar 06, 2002 at 19:37 UTC
    If you *do* end up rolling your own, check out some of the ideas in the Intertwingle musing by Jamie Zawinski. Lots of good ideas for yanking on a heap of email messages -- many obvious ideas, some not so obvious, but they all make sense once pointed out.


Re: Managing a Personal Email Archive
by Corion (Patriarch) on Mar 06, 2002 at 19:48 UTC

    I don't know about the status of Mark Overmeers mail program, but he developed the Mail::Box suite to deal with email in its various forms of existence for a mail reader program written in Perl/Tk.

    As for your idea to store all your mails in a database, this can be a curse and a blessing - you lose the tool/skillset associated with Unix, mostly grep and Perl to deal with mailfiles, but you gain (given a good table design) the SQL toolset to search and manipulate your mails.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      I've actually written code with Mail::Box. It was pretty strange getting used to - but not because of the module; because of the mail!

      I had to do something almost identical to what you are doing, but it was for filling out timesheets (dread!). I had it suck in my mail, dump the parts of the header and the body into different fields, and then run that against the tasks I had in bugzilla...

      I can't tell you how great this module is. Go for it! If you want to email me at work, I can try to dig up that code. I never polished it up, so it might be kind of messy... let me know:

      d m c g r e g g o r
      (a t)
      g e n p h y s i c s
      (d o t)
      c o m

Re: Managing a Personal Email Archive
by drewbie (Chaplain) on Mar 06, 2002 at 20:16 UTC
    This doesn't directly answer your question, but have you looked at Pronto!? It is a Gtk/perl email client which uses a DB & DBI for the message store. According to the homepage, it can also import MBOX folders, and export back out to them.

    If you can't use the program, perhaps it's DB schema will be useful to you. I used it briefly at a previous job, and it seemed to work fine, although I did end up using another mail program.

      After a little more digging, it appears the Pronto is a fork of CSCMail. The Pronto site says it was forked when CSCMail went to C. The CSCMail faq says version 2 will be in C w/o SQL. But the current version 1.6.2 is still in perl/gtk. His reasoning for no DB makes sense to me. So you decide. :-)

      Why do you need the message store to be a SQL database?

        > Why do you need the message store to be a SQL database?

        I thought a SQL database was the answer because I might have 50,000 archived messages today, and I am accumulating non-trivial, non-SPAM mail at a rate of 25,000 messages a year at an annual growth rate in excess of 50 percent per year. (Aside: it would be a lot easier to keep statistics on a mail archive if it was in a SQL database.)

        I assume that performance of the search function would be better with a SQL-based archive, and the performance would stay relatively constant as the size of the archive grows.

        Dave Aiello
        Chatham Township Data Corporation

Re: Managing a Personal Email Archive
by mojotoad (Monsignor) on Jul 17, 2002 at 16:44 UTC

    I just came across the Mail::Miner project by Simon Cozens out on CPAN. At first blush it seems to be addressing some of your issues -- perhaps you can join forces if you haven't already coded a solution for your problem.

    Matt (how's that for a delayed response?)