Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: Subsetting text files containing e-mails

by GrandFather (Cardinal)
on Jan 27, 2012 at 07:34 UTC ( #950273=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Subsetting text files containing e-mails
in thread Subsetting text files containing e-mails

"I think you're assuming that each e-mail will begin with '^From: '"

Actually, no. /^From:/im performs a case insensitive multi-line match. The ^ anchors the start of any line (and is unaffected by setting $/) so the match will find "From" and the start of the string or at the start of any following new line delimited "line". Try taking the sample code I provided reorder the header line, add new header lines, whatever takes your fancy so long as you don't add bogus blank lines before the "From" line.

Another useful link may be perlretut. There's a lot of reading there, but it will be worth the time working through it!

True laziness is hard work


Comment on Re^3: Subsetting text files containing e-mails
Download Code
Re^4: Subsetting text files containing e-mails
by PeterCap (Initiate) on Jan 27, 2012 at 08:26 UTC

    Aha! I get it. So essentially when a paragraph is found that contains '^From:' it places a marker at the beginning of that paragraph?

    I could not figure out how it was handling all the blank lines within the e-mails until I realized that it wasn't and didn't need to.

    Just to be clear, in order to actually subset the file I would still need to close and reopen it, right? I'm thinking something like:

    open (<MYDATA>, $filein); while (<MYDATA>) { if (/^---- Email 1/ ... /---- Email2/) { open (<MYOUTPUT>, ">$fileout"); print MYOUTPUT $_; close (MYOUTPUT); } } close (MYDATA);

    I suppose I might create a loop so that a new value for the search terms (i.e., /^---- Email 2/ ... /^---- Email 3/ for the second iteration, etc.) is selected as well as a new output file to catch the results...

      You don't need more than one pass through the source file. Just create the output files as you need them. In sketch you'd have something like:

      use strict; use warnings; my $emailNum; my $outFile; $/ = ''; # Set readline to "Paragraph mode" while (<DATA>) { if (!$emailNum || /^From:/im) { close $outFile if $outFile; my $fname = sprintf "mails_%06d.txt", ++$emailNum; open $outFile, '>', $fname or die "Can't create $fname: $!\n"; } print $outFile $_; }
      True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://950273]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2014-09-18 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (108 votes), past polls