Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Subsetting text files containing e-mails

by PeterCap (Initiate)
on Jan 25, 2012 at 09:18 UTC ( #949855=perlquestion: print w/ replies, xml ) Need Help??
PeterCap has asked for the wisdom of the Perl Monks concerning the following question:

I have a number of e-mail messages (envelope, headers, body, sometimes including base64-encoded attachments) which are spread out among an arbitrary number of text files. In most cases, there are multiple messages in each text file.

I want to subset each large file so that each e-mail is saved to its own file. So, if the source file is "mails.txt" containing three messages, I want to create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt.

I had kludged together a bash script that performed this operation thusly:

  1. using grep -n, note the line numbers for both blank lines (^$) and essential headers (e.g., "From:").
  2. Use those line number to determine the beginning of each message (that is, when a line beginning with "From" is discovered, I assume it is within a message, and then find the first blank line before that line)
  3. Subsetting each message using gawk.
However, I wish to create a perl-only solution for a number of reasons and I can't help but imagine that there is a slicker way to do this with perl. At this stage I'm mainly just looking for a general strategy for doing such an operation rather than full-fledged examples of code, but any assistance is welcome.

Thanks!

Comment on Subsetting text files containing e-mails
Re: Subsetting text files containing e-mails
by GrandFather (Cardinal) on Jan 25, 2012 at 09:52 UTC

    To an extent you can translate bash into Perl. The control structures (if and loops) translate without much trouble. Many of the system utilities that you'd be using in your bash script have Perl equivalents, although they aren't drop in replacements and you are right to guess there are better ways in Perl than the pipelined filter processing technique you likely used with bash.

    With Perl you will tend more to parse through the input file essentially a line at a time to find the "interesting bits" and generate output as you go. The thing is to be able to recognise an interesting bit before you have moved on to the next line. Perl has a neat trick that makes that pretty easy in your case. I apologise in advance for spoilers - the following is most of the solution you need so more than you asked for:

    #!/usr/bin/perl use strict; use warnings; my $emailNum; $/ = ''; # Set readline to "Paragraph mode" while (<DATA>) { if (!$emailNum || /^From:/im) { ++$emailNum; print "---- Email $emailNum\n"; } print; } __DATA__ From: here To: there Data: I have a number of e-mail messages (envelope, headers, body, sometimes + including base64-encoded attachments) which are spread out among an arbitrary nu +mber of text files. In most cases, there are multiple messages in each text fi +le. I want to subset each large file so that each e-mail is saved to its o +wn file. So, if the source file is "mails.txt" containing three messages, I wan +t to create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt +. From: somewhere To: elsewhere Data: A second email

    Prints:

    ---- Email 1 From: here To: there Data: I have a number of e-mail messages (envelope, headers, body, sometimes + including base64-encoded attachments) which are spread out among an arbitrary nu +mber of text files. In most cases, there are multiple messages in each text fi +le. I want to subset each large file so that each e-mail is saved to its o +wn file. So, if the source file is "mails.txt" containing three messages, I wan +t to create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt +. ---- Email 2 From: somewhere To: elsewhere Data: A second email

    See perlvar for a description of what $/ is doing.

    True laziness is hard work

      Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).

      Instead, how I have defined the problem is this:

      1. Find some line that is (almost certainly) going to be in the SMTP envelope (in this case, the only fields I think are almost guaranteed to be there are "From" and "Date").
      2. Find the preceding blank line (since I am almost certain that there are blank lines between each e-mail--of course, there are also blank lines within e-mails as well).

      So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:

      • the line number I'm on "right now"
      • the line number of the last blank line observed
      • the line number of the blank line before the blank line

      I can find and store these in an array and then do a second pass through the file to subset.

      Could I get your opinion on the following? It works, but I am certain it can be improved.

      #!/usr/bin/env perl use strict; use Getopt::Std; my %opts; my $FileToHandle; getopts('o:', \%opts); my $k = 0; sub ParseEmail { my $FileToProcess = $_[0]; my @mailBoundaries=(); my $myLine = 0; my $recentBlank = 1; my $previousBlank = 1; # First pass to find out where to split the file... open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces +s: $!\n"; while (<FILETOREAD>) { if (/^$/) { $recentBlank = $myLine+1; print "Recent Blank: $recentBlank\n"; } if (/^From: / && $previousBlank != $recentBlank) { push(@mailBoundaries, $previousBlank); $previousBlank = $recentBlank; print "PreviousBlank: $previousBlank\n"; } if (eof && $previousBlank == 1) { push(@mailBoundaries, $previousBlank); push(@mailBoundaries, $myLine+1); } elsif (eof && $previousBlank != 1) { push(@mailBoundaries, $myLine+1); } $myLine+=1; print "My Line: $myLine\n"; } close (FILETOREAD); # Second pass to subset the file my $i = 0; while ($i <= ($#mailBoundaries - 1)) { $k+=1; my $j = 0; open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr +ocess: $!\n"; while (<FILETOREAD>) { $j+=1; if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries +[$i+1]) { my $FileNameToWrite = $opts{'o +'} . "_" . sprintf("%06d", $k); print "Mail Boundaries:"; print map { "$_ \n" } @mailBoundaries; print "\n"; print "I am going to print lines @mailBoundaries[$ +i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\ +n"; open (FILETOWRITE, ">>$FileNameToWrite") or die "C +an't open $FileToProcess: $!\n"; print FILETOWRITE $_; } } close (FILETOREAD); $i+=1; } } foreach $FileToHandle (map { glob } @ARGV) { ParseEmail($FileToHandle); }

      One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.

      Thanks again--especially for the link to perlvar, very educational.

        "I think you're assuming that each e-mail will begin with '^From: '"

        Actually, no. /^From:/im performs a case insensitive multi-line match. The ^ anchors the start of any line (and is unaffected by setting $/) so the match will find "From" and the start of the string or at the start of any following new line delimited "line". Try taking the sample code I provided reorder the header line, add new header lines, whatever takes your fancy so long as you don't add bogus blank lines before the "From" line.

        Another useful link may be perlretut. There's a lot of reading there, but it will be worth the time working through it!

        True laziness is hard work
Re: Subsetting text files containing e-mails
by sundialsvc4 (Abbot) on Jan 26, 2012 at 14:43 UTC

    Okay, and the first thing I would try is to go to http://search.cpan.org and type in mailbox and pore through all of the 311 hits thereby produced.

    I am going to assume that these files are probably in some kind of standard “mailbox” format; certainly, the messages themselves are.   Therefore, I am going to be acting on the assumption that I am dealing with a well-known task that someone else has already thoroughly solved for me, either in part or (more likely) altogether.   Thoughts of having to waste my own time niggling with regular-expressions, simply are not going to enter set of initial project design assumptions.   I am going to plan to spend very little time writing and a lot of time looking.

      Your advice on project management is very much appreciated.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://949855]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2014-12-26 04:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (165 votes), past polls