Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

recommended storage format for email messages?

by perl5ever (Pilgrim)
on Jul 14, 2009 at 15:48 UTC ( [id://779963]=perlquestion: print w/replies, xml ) Need Help??

perl5ever has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow Monks!

I am storing email messages I get from an IMAP server into files, and because of their small size and to save on disk space I want to store multiple messages (like 100's) in a single file.

I'm looking for a good storage format to use, i.e. one that allows for efficient and reliable parsing. The messages will not be used with a mail reader, so there is no need to use any 'mailbox' formats.

The storage format only needs to be efficient for sequential traversal, and, of course, it would be nice if there was a perl module which already implemented the format. Having it be a 'plain text' format (so one can use command line tools like 'grep') would be a bonus, too.

Any suggestions?

Replies are listed 'Best First'.
Re: recommended storage format for email messages?
by davorg (Chancellor) on Jul 14, 2009 at 16:24 UTC

    Well, mbox is a pretty simple plain text format for storing mail. I know that you don't have a need for "mailbox" formats, but I can't think of a simpler way to store mail messages.

    It ticks your other boxes too. You can use grep on in and there are Perl modules for dealing with it too.

    --

    See the Copyright notice on my home node.

    Perl training courses

Re: recommended storage format for email messages?
by afoken (Chancellor) on Jul 14, 2009 at 19:29 UTC

    I use IMAPdir, which extends Maildir++ with a folder hierarchie. Maildir++ inherits from maildir and adds a quota system. All of these give you one file per e-mail, all without needing locks, NFS-safe, and without any modification to the e-mail. You can parse the files with exactly the same tools that you use to parse an e-mail fetched from the net. And yes, you can use grep and all other text processing tools on the files in the maildir/Maildir++/IMAPdir folders. Sequencial access is no problem, just use readdir() or File::Find to iterate over the directory.

    Storing several hundred files in an ext3 filesystem is no problem. With 100_000 files, things begin to look different. It works, but ext3 does not like it and slows down. RaiserFS is said to be faster in that case, but I've never tested it.

    I've used the de-facto standard mbox format since the days of Netscape Communicator, but it became slow as hell when the mailboxes filled up. Some day, I gave the IMAPdir format a try, splitted all mailboxes into the IMAPdir format, switched my IMAP daemon from pine's to bincimap, and found that it was much faster.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: recommended storage format for email messages?
by hangon (Deacon) on Jul 14, 2009 at 21:18 UTC

    A quick and dirty method I've used is to add a fixed length ascii field before each message. The field contains the message length in bytes. Then concat the messages with length field into a single file. To traverse, use read to get the message length, then read to grab the entire message or seek to the next one. The format is fairly trivial to code as long as your file will be used readonly. Note that it may not be portable to a different OS.

    1148 Return-Path: <user@foo.com> Received: from someone@wherever.com To: me@email.com Subject: file format Date: Tue, 14 Jul 2009 16:00:18 -0400 email message body bla bla bla etc ... etc ... etc ... 729 Return-Path: <user@foo.com> Received: ... Date: Tue, 14 Jul 2009 16:00:18 -0400 another email bla bla bla etc ...
Re: recommended storage format for email messages?
by JavaFan (Canon) on Jul 14, 2009 at 16:04 UTC
    Well, I was going to suggest the 'mbox' format - which is 'plain text', and has a battery of Perl modules able to parse it, but it seems you want to dismiss that format.
      The main reason I don't like the mbox format is that it does not support an efficient way to skip over messages in the file.

      All I'm looking for is a storage format that is line-based and has a header before each message indicating the message's size.

      I could invent such a format, but I was wondering if one already existed.

        Sounds like you just need to generate an index, then, with the starting line of each of the messages. If you kept it separate, then normal mbox-reading programs could use it, and you'd have your alternate way of skipping to messages.

Re: recommended storage format for email messages?
by Bloodnok (Vicar) on Jul 14, 2009 at 16:22 UTC
    Your requirement that ...there is no need to use any 'mailbox' formats... is, AFAIK, of no consequence since the single file storage format is a function of sendmail - mail readers know that the e-mails in the mailbox file are stored as sequential ASCII text blocks - at least on *NIX anyway.

    I don't know and care less about the format used by Lookout (errrm, Outlook)/Exchange - but I dare bet it's in some weird proprietary binary format...

    A user level that continues to overstate my experience :-))

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://779963]
Approved by Bloodnok
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (11)
As of 2024-03-28 09:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found