Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Parsing email files, what are the best modules ?

by peterr (Scribe)
on Nov 10, 2003 at 05:14 UTC ( #305793=perlquestion: print w/ replies, xml ) Need Help??
peterr has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am finding my email distribution lists (Pegasus Mail) are not being kept up to date properly, and need to do the following:

1. Open the distribution list file (D:\Pmail\mail\list25B6.pml), and read it into an array. The file looks like this and has CR/LF's.

\TITLE Email Distribution \SENDER Peter Rabbitt <peterr@example.com> \NOSIG Y John & Jenny Arnold <johnarnold@somedomain.com> David & Jan Barker <jbarker@someotherdomain.org> etc,etc

There are about 300 records in the file.

2. Open the list of all the email folders, this is stored in a file called hierarch.pm, format as follows:

2,1,"70937460:My mailbox","","My mailbox" 0,0,"750EE41A:1A94:FOL07093","70937460:My mailbox","Main" 0,0,"6670FCDB:1A95:FOL04816","70937460:My mailbox","Sent" 1,1,"0E457462:Developer","70937460:My mailbox","Developer" 0,0,"46FBE5F3:1A97:FOL024EB","0E457462:Developer","Microsoft"

This file also has CR/LF's. Any record starting with a value "0,0" is an email folder record, all the other are just logical representations of the email hierarchy (trays,etc). It is the email folder records that tell me the filename, for example, the second record shown above, is the "Main" folder, but at DOS is FOL07093.PMM

3. I then need to parse through every email folder, and look for any email headers with "From:" or "To:" , extract the email address, and if it isn't in the array from step 1, then put the details out to a file (or an array would be wiser, to check for dups as I parse through, then at the end, write out a file).

Some "To:" records could have multiple email addresses I guess, so I need to cater for that.

I have looked through examples and documentation on MIME::Parser, MIME::Tools and Mail::Address , but don't know which modules would be best suited for this type of work. I'm running it on a Win32 box, with Active state Perl, version 5.61

I'm a newbie to Perl, so please be nice to me. :)

Thanks,

Peter

Comment on Parsing email files, what are the best modules ?
Select or Download Code
Re: Parsing email files, what are the best modules ?
by Roger (Parson) on Nov 10, 2003 at 06:58 UTC
    You could have a look at Mail::Box module from CPAN. I assume that the mail folders are in the format of unix mboxes, ascii-mode, line-by-line.

    I have started on a simple perl app to do what you have described...
    #C:\Perl\bin\Perl.exe -w use strict; use IO::File; use Data::Dumper; use Mail::Box; # Load mail list my $MailList = load_mail_list('./list25B6.txt'); print Dumper($MailList); # Load folder list my $MailFolder = load_mail_folders('./hierarch.txt'); print Dumper($MailFolder); # Parse folder files foreach (values %{$MailFolder}) { parse_mail_folder($_); } sub parse_mail_folder { # to be completed when I get back home... } sub load_mail_list { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mlist; # load the header chomp($mlist{title} = <$f>); chomp($mlist{sender} = <$f>); chomp($mlist{nosig} = <$f>); # load the rest of the email addresses my %MailAddress; while (<$f>) { chomp; my ($name, $email) = /^(.*)\s+<(.*)>$/; next if $email eq ''; $MailAddress{$email} = $name; } $mlist{mlist} = \%MailAddress; return \%mlist; } sub load_mail_folders { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail fold +er list"; my %mbox; while (<$f>) { chomp; next unless ( $_ ne '' and m/^0,0,/ ); s/"//g; my @fld = split /,/; my $folder = (split /:/, $fld[2])[2]; # capture 3rd field $mbox{$fld[-1]} = "D:/Pmail/mail/$folder.PPM"; # full path to +mboxes } return \%mbox; }
    And the output so far...
    $VAR1 = { 'title' => '\\TITLE Email Distribution', 'nosig' => '\\NOSIG Y', 'mlist' => { 'jbarker@someotherdomain.org' => 'David & Jan B +arker', 'johnarnold@somedomain.com' => 'John & Jenny Ar +nold' }, 'sender' => '\\SENDER Peter Rabbitt <peterr@example.com>' }; $VAR1 = { 'Main' => 'D:/Pmail/mail/FOL07093.PPM', 'Microsoft' => 'D:/Pmail/mail/FOL024EB.PPM', 'Sent' => 'D:/Pmail/mail/FOL04816.PPM' };
      Hi Roger,

      Many thanks for that example you posted.

      I assume that the mail folders are in the format of unix mboxes, ascii-mode, line-by-line.

      Yes, they are ascii-mode, with CR/LF's. When I tried the script, a message:

      D:\Perl\myscripts>\perl\bin\perl.exe checke~1.pl Can't locate Mail/Box.pm in @INC (@INC contains: D:/Perl/lib D:/Perl/s +ite/lib .) at checke~1.pl line 5. BEGIN failed--compilation aborted at checke~1.pl line 5.

      I then checked for Mail::Box, and it didn't appear to be part of the Active State Perl I have installed. I then used the PPM (version 3.0.1) and did an "install Mail::Box" command, it took about 10 mins, but said everything was okay. However, the same error message appeared.

      The PPM search for Mail::Box displayed

      ppm> search Mail::Box Searching in Active Repositories 1. Mail-Box-Parser-C [3.003] C parser for Mail::Box
      and that is all that was installed, just a file called "C.pm" in a folder D:\Perl\site\lib\Mail\Box\. I did download the file http://perl.overmeer.net/mailbox/source/source-current.tar.gz , and there is a Box.Pm in that file, but I don't know where to put it. No doubt somehow I should reference the 'tar' file in the PPM, for the install ? When I do a SET command at DOS, there are no environment variables for Perl ?

      I'm trying to read up more on the documentation also.

      Many thanks,

      Peter

        Hi Peter, you could read the PPM documentation like what the Anonymous Monk has suggested. Also you probably need all the Mail::Box and its derived modules as well.

        I have complete the code I started earlier. The additional code is an example on the kind of thing you could do with the Mail::Box::Manager module. Pretty handy I think.
        #C:\Perl\bin\Perl.exe -w use strict; use IO::File; use Data::Dumper; use Mail::Box; use Mail::Box::Manager; # Load mail list my $MailList = load_mail_list('./list25B6.txt'); print Dumper($MailList); # Load folder list my $MailFolder = load_mail_folders('./hierarch.txt'); print Dumper($MailFolder); # Parse folder files foreach (values %{$MailFolder}) { parse_mail_folder($_); } # Optionally output $MailList into another file, etc. # And other things ... exit(0); sub parse_mail_folder { my $folder_file = shift; my $mgr = Mail::Box::Manager->new(); my $folder = $mgr->open($folder_file); my @email_addr; foreach my $message ($folder->messages) { my $dest = $message->get('To'); # retrieve the To-address @email_addr = split /,/, $dest; # retrieve multiple addresses # assume the email address format is as follows - # # John & Jenny Arnold <johnarnold@somedomain.com> # # you have to tweak a bit if the format is not as expected # or use the Mail::Address module to do the trick - to # convert the mail address into its canonical form. foreach (@email_addr) { my ($name, $addr) = /(.*)<(.*)>/; $name = s/^\s+//g; # trim spaces at front $name = s/\s+$//g; # trim spaces at rear $addr = s/^\s+//g; # trim spaces at front $addr = s/\s+$//g; # trim spaces at rear if (! exists $MailList->{$addr}) { # ok, we haven't seen this Email address yet $MailList->{$addr} = $name; # and do other things } } } $folder->close; } sub load_mail_list { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mlist; # load the header chomp($mlist{title} = <$f>); chomp($mlist{sender} = <$f>); chomp($mlist{nosig} = <$f>); <$f>; # load the rest of the email addresses my %MailAddress; while (<$f>) { chomp; my ($name, $email) = /^(.*)\s+<(.*)>$/; next if $email eq ''; $MailAddress{$email} = $name; } $mlist{mlist} = \%MailAddress; return \%mlist; } sub load_mail_folders { my $filename = shift; my $f = new IO::File $filename, "r" or die "Can not open mail list +"; my %mbox; while (<$f>) { chomp; next unless ( $_ ne '' and m/^0,0,/ ); s/"//g; my @fld = split /,/; my ($folder) = $fld[2] =~ /.*:.*:(.*)/; $mbox{$fld[-1]} = "D:/Pmail/mail/$folder.PPM"; # full path to +mboxes } return \%mbox; }

      I never use Windows, so cannot help you with the installation. However, I know that the tests produce many errors and warnings which can be ignored: the Windows users of MailBox seem unable to help me with real fixes for the tests.

      For your implementation, I advice one of these two approached: use Mail::Message->build() (look for the details of build in this module by selecting the "methods sorted alphabetically" in the right column).

      The other approach may be much simpler: first reconstruct your data into a MIME compliant message, and then call Mail::Message->read($m).

      More help available at the mailbox mailinglist.

      By the way: best way to parse e-mail addresses from a header line is like this:

         my $msg = Mail::Message->read($data);
         my @addr = $msg->get('To')->addresses;
      
      The addresses are Mail::Address objects, which are relatively smart. Parsing addresses in reality is a very complicated task.
      Mark Overmeer.

        Hi Mark,

        I never use Windows, so cannot help you with the installation. However, I know that the tests produce many errors and warnings which can be ignored: the Windows users of MailBox seem unable to help me with real fixes for the tests.

        I have only previously used Perl on a Linux box, and apart from my ignorance of Perl, there have really been no problems. Using it on Windows has been SO different, lots of problems, but I wanted to also use it on Win to improve my Perl skills, plus I save on bandwidth. There are many ad-hoc things I would have previously done in Clipper, but I can see how much more powerful, for tasks of this nature, Perl is.

        In regards to getting it all going on Win, fortunately it is all sorted out now. I removed everything, re-installed ActiveState Perl 5.8.0.806, and then tried the sample code that Roger kindly supplied. There were so many error messages (not from Rogers code), that I added the CGI::Carp module to log all the messages out to a file. That was very handy. Then as I found out the cause (looking in various .PM files), I first attempted to install the missing modules by using PPM. However with only two standard repositories and the trouble I had with referencing either local or remote 'repositories', I decided the only (best) option for the missing modules was to download the entire ...tar.gz file in each case, read the 'install/redame' then do the actual install. All of them worked fine, so the underlying problem in using the code supplied was not the code itself, but module dependency. The modules I had to install manually (makefile, ,etc,etc) were:

        IO

        IO-stringy

        Mail-Box

        MailTools

        TimeDate

        then the Perl script worked. There is only one minor problem, where it is not jumping into a 'foreach' loop, but I _think_ that is because I need to do a bit more research on the 'type' of mailbox being opened. :)

        For your implementation, I advice one of these two approached: use Mail::Message->build() (look for the details of build in this module by selecting the "methods sorted alphabetically" in the right column).

        Thanks for that, I did have a look. In the current task/probem, I'm wanting to read multilple email boxes and check that I have distribution lists up to date. However, I will certainly come back to the "build" because another task is to fix the problems I'm having with using Net::SMTP, so possibly I could look at using Mail::Message instead.

        The other approach may be much simpler: first reconstruct your data into a MIME compliant message, and then call Mail::Message->read($m)

        That could be a good solution for this problem, because the mailboxes are not what I would call 'standard', if standard means anything of a *nix flavour. I notice when I go to add a new mailbox under Pegasus mail, there is an option to create it in either:

        Pegasus Mail v2.X

        Unix Mailbox format

        unfortunately, the default is the first one, so I think the only remaining problem with using Mail::Box on a Win box is for me to get into a hexviewer and see what is there that is so different to a *nix mailbox. It may be better to just open each mailbox as an ascii file, and then re-create it, for temporary purposes, as a Unix format. The other two subs in Rogers code work perfectly ("load_mail_list" and "load_mail_folders"), but I have a feeling the only reason the 'foreach' loop isn't being executed in sub "parse_mail_folder" is only because the actual mailbox (Pegasus format) is not "normal". There is all the usuall email headers and many of the folders/mailboxes have multi-part messages in them, but the first record has a lot of extra chars in it. :)

        More help available at the mailbox mailinglist

        Thanks, I will do a little bit more reading on the Mail::Box information in regards to opening mailboxes, and then will ask for help (I did add a "or die" after the open, but it's okay ??)

        By the way: best way to parse e-mail addresses from a header line is like this:

        my $msg = Mail::Message->read($data); my @addr = $msg->get('To')->addresses;

        The addresses are Mail::Address objects, which are relatively smart. Parsing addresses in reality is a very complicated task.

        I will try that method out, the code I had been using

        my $mgr = Mail::Box::Manager->new(); my $folder = $mgr->open($folder_file) or die "Cannot open Folder","\n" +; my @email_addr; foreach my $message ($folder->messages) { print $message->get('Subject') || '<no subject>', "\n"; print "into foreach loop","\n"; # etc,etc

        .. never gets to print the "into foreach loop", but I guess it is the mailbox 'type'. :)

        Thanks for all your help,

        Peter

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://305793]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2014-07-30 15:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (235 votes), past polls