P is for Practical | |
PerlMonks |
Re^2: Subsetting text files containing e-mailsby PeterCap (Initiate) |
on Jan 27, 2012 at 07:19 UTC ( [id://950271]=note: print w/replies, xml ) | Need Help?? |
Much appreciated! However, I think you're assuming that each e-mail will begin with '^From: ', which is not the case since the SMTP envelope may contain any arbitrary number of lines before then (e.g., 'x-sender:') (I think if I could rely on every e-mail to start with the same sequence--or to have any guaranteed structure--then this might certainly be easier! But from reading the germane RFCs I can't count on that structure necessarily). Instead, how I have defined the problem is this:
So, it occurred to me that all I need to do is read through the file line-by-line and keep track of three line numbers:
I can find and store these in an array and then do a second pass through the file to subset. Could I get your opinion on the following? It works, but I am certain it can be improved.
One way to improve it might be to simply store all the preceding lines in a buffer array, and when I encounter a "From," instead of recording the blank line numbers, write that array to a file. I think this borrows a page from your book inasmuch as I'd be reading more than just a line at a time, but I haven't yet worked out how to know when I have encompassed an "interesting chunk" of the source data and can write it to a file. Working on that now. I'm really not sure how much of performance hit I should expect for very large arrays (considering many of these e-mails may have very large attachments which are even larger when rendered as base64). I would appreciate your thoughts on that as well. Thanks again--especially for the link to perlvar, very educational.
In Section
Seekers of Perl Wisdom
|
|