Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Subsetting text files containing e-mails

by GrandFather (Cardinal)
on Jan 25, 2012 at 09:52 UTC ( #949860=note: print w/ replies, xml ) Need Help??


in reply to Subsetting text files containing e-mails

To an extent you can translate bash into Perl. The control structures (if and loops) translate without much trouble. Many of the system utilities that you'd be using in your bash script have Perl equivalents, although they aren't drop in replacements and you are right to guess there are better ways in Perl than the pipelined filter processing technique you likely used with bash.

With Perl you will tend more to parse through the input file essentially a line at a time to find the "interesting bits" and generate output as you go. The thing is to be able to recognise an interesting bit before you have moved on to the next line. Perl has a neat trick that makes that pretty easy in your case. I apologise in advance for spoilers - the following is most of the solution you need so more than you asked for:

#!/usr/bin/perl use strict; use warnings; my $emailNum; $/ = ''; # Set readline to "Paragraph mode" while (<DATA>) { if (!$emailNum || /^From:/im) { ++$emailNum; print "---- Email $emailNum\n"; } print; } __DATA__ From: here To: there Data: I have a number of e-mail messages (envelope, headers, body, sometimes + including base64-encoded attachments) which are spread out among an arbitrary nu +mber of text files. In most cases, there are multiple messages in each text fi +le. I want to subset each large file so that each e-mail is saved to its o +wn file. So, if the source file is "mails.txt" containing three messages, I wan +t to create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt +. From: somewhere To: elsewhere Data: A second email

Prints:

---- Email 1 From: here To: there Data: I have a number of e-mail messages (envelope, headers, body, sometimes + including base64-encoded attachments) which are spread out among an arbitrary nu +mber of text files. In most cases, there are multiple messages in each text fi +le. I want to subset each large file so that each e-mail is saved to its o +wn file. So, if the source file is "mails.txt" containing three messages, I wan +t to create (e.g.) mails_000001.txt, mails_000002.txt, and mails_000003.txt +. ---- Email 2 From: somewhere To: elsewhere Data: A second email

See perlvar for a description of what $/ is doing.

True laziness is hard work


Comment on Re: Subsetting text files containing e-mails
Select or Download Code
Re^2: Subsetting text files containing e-mails
by PeterCap (Initiate) on Jan 27, 2012 at 07:19 UTC

    Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily).

    Instead, how I have defined the problem is this:

    1. Find some line that is (almost certainly) going to be in the SMTP envelope (in this case, the only fields I think are almost guaranteed to be there are "From" and "Date").
    2. Find the preceding blank line (since I am almost certain that there are blank lines between each e-mail--of course, there are also blank lines within e-mails as well).

    So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:

    • the line number I'm on "right now"
    • the line number of the last blank line observed
    • the line number of the blank line before the blank line

    I can find and store these in an array and then do a second pass through the file to subset.

    Could I get your opinion on the following? It works, but I am certain it can be improved.

    #!/usr/bin/env perl use strict; use Getopt::Std; my %opts; my $FileToHandle; getopts('o:', \%opts); my $k = 0; sub ParseEmail { my $FileToProcess = $_[0]; my @mailBoundaries=(); my $myLine = 0; my $recentBlank = 1; my $previousBlank = 1; # First pass to find out where to split the file... open (FILETOREAD, $FileToProcess) or die "Can't open $FileToProces +s: $!\n"; while (<FILETOREAD>) { if (/^$/) { $recentBlank = $myLine+1; print "Recent Blank: $recentBlank\n"; } if (/^From: / && $previousBlank != $recentBlank) { push(@mailBoundaries, $previousBlank); $previousBlank = $recentBlank; print "PreviousBlank: $previousBlank\n"; } if (eof && $previousBlank == 1) { push(@mailBoundaries, $previousBlank); push(@mailBoundaries, $myLine+1); } elsif (eof && $previousBlank != 1) { push(@mailBoundaries, $myLine+1); } $myLine+=1; print "My Line: $myLine\n"; } close (FILETOREAD); # Second pass to subset the file my $i = 0; while ($i <= ($#mailBoundaries - 1)) { $k+=1; my $j = 0; open (FILETOREAD, $FileToProcess) or die "Can't open $FileToPr +ocess: $!\n"; while (<FILETOREAD>) { $j+=1; if ($j >= @mailBoundaries[$i] && $j <= @mailBoundaries +[$i+1]) { my $FileNameToWrite = $opts{'o +'} . "_" . sprintf("%06d", $k); print "Mail Boundaries:"; print map { "$_ \n" } @mailBoundaries; print "\n"; print "I am going to print lines @mailBoundaries[$ +i] to @mailBoundaries[$i+1] from $FileToProcess to $FileNameToWrite.\ +n"; open (FILETOWRITE, ">>$FileNameToWrite") or die "C +an't open $FileToProcess: $!\n"; print FILETOWRITE $_; } } close (FILETOREAD); $i+=1; } } foreach $FileToHandle (map { glob } @ARGV) { ParseEmail($FileToHandle); }

    One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well.

    Thanks again--especially for the link to perlvar, very educational.

      "I think you're assuming that each e-mail will begin with '^From: '"

      Actually, no. /^From:/im performs a case insensitive multi-line match. The ^ anchors the start of any line (and is unaffected by setting $/) so the match will find "From" and the start of the string or at the start of any following new line delimited "line". Try taking the sample code I provided reorder the header line, add new header lines, whatever takes your fancy so long as you don't add bogus blank lines before the "From" line.

      Another useful link may be perlretut. There's a lot of reading there, but it will be worth the time working through it!

      True laziness is hard work

        Aha! I get it. So essentially when a paragraph is found that contains '^From:' it places a marker at the beginning of that paragraph?

        I could not figure out how it was handling all the blank lines within the e-mails until I realized that it wasn't and didn't need to.

        Just to be clear, in order to actually subset the file I would still need to close and reopen it, right? I'm thinking something like:

        open (<MYDATA>, $filein); while (<MYDATA>) { if (/^---- Email 1/ ... /---- Email2/) { open (<MYOUTPUT>, ">$fileout"); print MYOUTPUT $_; close (MYOUTPUT); } } close (MYDATA);

        I suppose I might create a loop so that a new value for the search terms (i.e., /^---- Email 2/ ... /^---- Email 3/ for the second iteration, etc.) is selected as well as a new output file to catch the results...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://949860]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-07-12 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (240 votes), past polls