Re: Subsetting text files containing e-mails

To an extent you can translate bash into Perl. The control structures (if and loops) translate without much trouble. Many of the system utilities that you'd be using in your bash script have Perl equivalents, although they aren't drop in replacements and you are right to guess there are better ways in Perl than the pipelined filter processing technique you likely used with bash.

With Perl you will tend more to parse through the input file essentially a line at a time to find the "interesting bits" and generate output as you go. The thing is to be able to recognise an interesting bit before you have moved on to the next line. Perl has a neat trick that makes that pretty easy in your case. I apologise in advance for spoilers - the following is most of the solution you need so more than you asked for:

#!/usr/bin/perl
use strict;
use warnings;

my $emailNum;

$/ = ''; # Set readline to "Paragraph mode"

while (<DATA>) {
    if (!$emailNum || /^From:/im) {
        ++$emailNum;

        print "---- Email $emailNum\n";
    }

    print;
}

__DATA__
From: here
To: there
Data:
I have a number of e-mail messages (envelope, headers, body, sometimes
+ including
base64-encoded attachments) which are spread out among an arbitrary nu
+mber of
text files. In most cases, there are multiple messages in each text fi
+le.

I want to subset each large file so that each e-mail is saved to its o
+wn file.
So, if the source file is "mails.txt" containing three messages, I wan
+t to
create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt
+.

From: somewhere
To: elsewhere
Data:
A second email
[download]

Prints:

---- Email 1
From: here
To: there
Data:
I have a number of e-mail messages (envelope, headers, body, sometimes
+ including
base64-encoded attachments) which are spread out among an arbitrary nu
+mber of
text files. In most cases, there are multiple messages in each text fi
+le.

I want to subset each large file so that each e-mail is saved to its o
+wn file.
So, if the source file is "mails.txt" containing three messages, I wan
+t to
create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt
+.

---- Email 2
From: somewhere
To: elsewhere
Data:
A second email
[download]

See perlvar for a description of what $/ is doing.

True laziness is hard work

Comment on Re: Subsetting text files containing e-mails Select or Download Code

Replies are listed 'Best First'.
Re^2: Subsetting text files containing e-mails by PeterCap (Initiate) on Jan 27, 2012 at 07:19 UTC
Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily). Instead, how I have defined the problem is this: Find some line that is (almost certainly) going to be in the SMTP envelope (in this case, the only fields I think are almost guaranteed to be there are "From" and "Date"). Find the preceding blank line (since I am almost certain that there are blank lines between each e-mail--of course, there are also blank lines within e-mails as well). So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers: the line number I'm on "right now" the line number of the last blank line observed the line number of the blank line before the blank line I can find and store these in an array and then do a second pass through the file to subset. Could I get your opinion on the following? It works, but I am certain it can be improved. #!/usr/bin/env perl use strict; use Getopt::Std; my %opts; my $FileToHandle; getopts('o:', \%opts); my $k = 0; sub ParseEmail { my $FileToProcess = $_[0]; my @mailBoundaries=(); my $myLine = 0; my $recentBlank = 1; my $previousBlank = 1; # First pass to find out where to split the file... open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces +s: $!\n"; while (<FILETOREAD>) { if (/^$/) { $recentBlank = $myLine+1; print "Recent Blank: $recentBlank\n"; } if (/^From: / && $previousBlank != $recentBlank) { push(@mailBoundaries, $previousBlank); $previousBlank = $recentBlank; print "PreviousBlank: $previousBlank\n"; } if (eof && $previousBlank == 1) { push(@mailBoundaries, $previousBlank); push(@mailBoundaries, $myLine+1); } elsif (eof && $previousBlank != 1) { push(@mailBoundaries, $myLine+1); } $myLine+=1; print "My Line: $myLine\n"; } close (FILETOREAD); # Second pass to subset the file my $i = 0; while ($i <= ($#mailBoundaries - 1)) { $k+=1; my $j = 0; open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr +ocess: $!\n"; while (<FILETOREAD>) { $j+=1; if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries +[$i+1]) { my $FileNameToWrite = $opts{'o +'} . "_" . sprintf("%06d", $k); print "Mail Boundaries:"; print map { "$_ \n" } @mailBoundaries; print "\n"; print "I am going to print lines @mailBoundaries[$ +i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\ +n"; open (FILETOWRITE, ">>$FileNameToWrite") or die "C +an't open $FileToProcess: $!\n"; print FILETOWRITE $_; } } close (FILETOREAD); $i+=1; } } foreach $FileToHandle (map { glob } @ARGV) { ParseEmail($FileToHandle); } [download] One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well. Thanks again--especially for the link to perlvar, very educational.	[reply] [d/l]
Re^3: Subsetting text files containing e-mails by GrandFather (Saint) on Jan 27, 2012 at 07:34 UTC
"I think you're assuming that each e-mail will begin with '^From: '" Actually, no. `/^From:/im` performs a case insensitive multi-line match. The ^ anchors the start of any line (and is unaffected by setting $/) so the match will find "From" and the start of the string or at the start of any following new line delimited "line". Try taking the sample code I provided reorder the header line, add new header lines, whatever takes your fancy so long as you don't add bogus blank lines before the "From" line. Another useful link may be perlretut. There's a lot of reading there, but it will be worth the time working through it! True laziness is hard work	[reply] [d/l]
Re^4: Subsetting text files containing e-mails by PeterCap (Initiate) on Jan 27, 2012 at 08:26 UTC
Aha! I get it. So essentially when a paragraph is found that contains '^From:' it places a marker at the beginning of that paragraph? I could not figure out how it was handling all the blank lines within the e-mails until I realized that it wasn't and didn't need to. Just to be clear, in order to actually subset the file I would still need to close and reopen it, right? I'm thinking something like: `open (<MYDATA>, $filein); while (<MYDATA>) { if (/^---- Email 1/ ... /---- Email2/) { open (<MYOUTPUT>, ">$fileout"); print MYOUTPUT $_; close (MYOUTPUT); } } close (MYDATA);` [download] I suppose I might create a loop so that a new value for the search terms (i.e., `/^---- Email 2/ ... /^---- Email 3/` for the second iteration, etc.) is selected as well as a new output file to catch the results...	[reply] [d/l] [select]
Re^5: Subsetting text files containing e-mails by GrandFather (Saint) on Jan 27, 2012 at 09:17 UTC


Perl: the Markov chain saw
	PerlMonks