AssFace has asked for the wisdom of the Perl Monks concerning the following question:

I made reference to this before in this post and now raise the point again, but with a more direct plea.

I have two directories of e-mail. One is full of spam, the other full of ham.
I would like to iterate over them, open them up, look at only the headers, and get all of the e-mails from them (To, Cc, Bcc - From doesn't really matter... and technically Bcc doesn't matter either since it leaves that entirely empty many times).
From there I want to look at each one and see if my domain is in there, if so, pull out that user and that to an array.

All of this is trivial, or so I thought. I can do the iterating, the arrays, etc. I had figured that I would use the MailTools package to pull out only the headers and save me the work - but apparently that refuses to work for me (perhaps because this is all on an Exchange server and it doesn't format them properly? don't know).

So what I want to know is how to get just the headers of the e-mail? Without anything so elegant as using a package to do it - just straight forward, brutish and in your face ugliness of code... or something.

I thought this would be a trivial issue, but after going through hundreds of mail examples, I am seeing that the headers rarely have the same format... which makes it hard to grab specific fields from them.
It is easy to find where to start - look for "To: " and then keep grabbing things until you get to... ahhh, there's the rub. It is never (Rarely) the same in these e-mails...
So then I think, why not just the headers? I know that they start... at the beginning of the message. Good, I know where to start... then they stop... well, again, fairly arbitrarily from what I can tell looking at these hundreds of messages.

So how do I know where to stop? Seems like many programming questions boil down to this. I know how to do something, but how do I know when to stop doing it?

I hope that this is brazenly obvious and I'm a total moron for not seeing it - but for now, and perhaps it is the heat, or even the humidity, I am stumped and sweaty.

Thanks to any and all that have a response. Even better if it is helpful.

-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.

Replies are listed 'Best First'.
Re: Finding e-mail headers
by particle (Vicar) on Jun 30, 2003 at 17:56 UTC

    they stop at \n\n. order is unimportant. my( $key, $value )= split /:/, $_, 2; should get you going on processing them.

    ~Particle *accelerates*

Re: Finding e-mail headers
by Mr. Muskrat (Canon) on Jun 30, 2003 at 18:00 UTC

    Have you looked at Mail::Header?

    An email is composed of the headers, one blank line and the body. A regex could even be used if you really wanted to. Untested: my ($header, $body) = $email =~ /^(.*?\n)\n(.*)$/;

      In addition to Mail::header (which is almost certainly the best starting place) you may wish to also look at Mail::Field both of which are part of MailTools.
        I'll give that a shot and have a look - since that was the module that I was trying to use initially and which fails on these files, I figured perhaps it wasn't the best place to take code from in this case.

        but that is an excellent point, and perhaps it is something as minor as instead of \n\n, it is throwing a \r in there as well since IIRC Windows enjoys those.

        -------------------------------------------------------------------
        There are some odd things afoot now, in the Villa Straylight.
      Yeah, I ran MailTools on it the first shot thinking that would be far easier and that was what started this and the previous post that I made.

      I just tried out that code (yours their) on the file and that too failed. Which, if that is how MailTools does it, is likely why it didn't work I guess.

      I thought I had seen that Windows adds a "\r" when it puts down "\n"s, so I tried all the combinations of that being in there with the "\n" and those too all failed to show headers (or the body with that code).



      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
Re: Finding e-mail headers
by grinder (Bishop) on Jun 30, 2003 at 18:32 UTC

    If you have slurp your headers into a scalar, it's easy to chop it up into an array with multiline headers correctly stitched up together:

    my $header = <<'HEADER'; Return-Path: <NITAIGOURANGA@AOL.COM> X-Original-To: grinder@example.com Delivered-To: grinder@example.com Received: from RX504Second (ACBC197E.ipt.aol.com [172.188.25.126]) by example.com (Postfix) with SMTP id C5960A94C for <grinder@example.com>; Mon, 30 Jun 2003 13:18:36 +0200 (CE +ST) From: "GOURANGA" <NITAIGOURANGA@AOL.COM> HEADER my @header = split /\n(?!\s+)/, $header;

    You might want to post-process each element to fold whitespace as well. A module will probably do a better job, but I find tr/\n\t / /s is usually good enough for my needs.

    Once you have your headers, I would suggest looking at the Return-Path: header, which holds the envelope sender of the message. I would also suggest you take a look at the Received: headers as well. These two items are very revealing when it comes to dealing with spew. The To: and From: headers are nearly always forged, or at least irrelevant, in spammers' messages.

    _____________________________________________
    Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

      I'll have a try at that.

      In terms of the spam, I have SpamAssassin already doing that - the stats I'm interested are only in our own users - seeing who is getting the most ham/spam.
      I don't particularly care who the spam is from for the exact reason that you mention.

      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
        It may be easier to parse the entries from /var/log/maillog or /var/adm/messages (depending what system you are running sa on) and build a hash that has the message ID/to/from/spam rating and normalize from there.

        -Waswas
Re: Finding e-mail headers
by BazB (Priest) on Jun 30, 2003 at 18:15 UTC

    There are a whole host of email handling packages on CPAN, and they really are the easiest way.

    If you don't want to use modules, you're asking for a World of Pain, but the RFCs are there to help.
    Look at RFCs 822 and more importantly 2822 for a detailed description of Internet Messages (i.e. email).

    Once again, I really suggest you use the robust modules from CPAN, rather than try it yourself.


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

      I would love to and perhaps even enjoy using any one of the CPAN modules provided they worked on the mail. I looked through them and tried the MailTools, and that (being what my previous thread that I referenced) failed and I don't know why.

      Were I to know why, it would make life easier, but that module apparently doesn't have any error handling. It just dies quietly and peacefully, leaving me wondering what it doesn't like about the files.

      If I had to guess, I would guess that Windows gets an "\r" in there and that breaks it - but I haven't had time today to go and test out that theory (removing any "\r" in there and then trying to grab the headers with Mail::Internet and/or its Mail::Header.

      If you have a specific module in mind that you think would work, even if MailTools fails, I'm all ears.
      I am just a little gun shy to go running and out and trying them all (I am on Windows and using ActiveState, which IMO is more annoying than using Perl in a *nix environment) and not having any of them work - leaving my exactly where I was, but having spent a lot of time learning which ones don't work and nothing else.

      I did attempt to look over the RFCs for the email, but the depth of reading was hard for me to justify for a script that is in the end to track spam stats on our mail server.
      I am hoping that someone that is already far more knowledgable about it would be willing to pipe up and save me the time, but if not, perhaps I will be doing this in my free time then, rather than on the clock at work since they have a stack of other things I need to get done first.

      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
3 examples
by AssFace (Pilgrim) on Jun 30, 2003 at 18:18 UTC
    Here are 3 examples of what the messages look like in the text file (edited to remove real names and or info)
    (in a readmore)

    -------------------------------------------------------------------
    There are some odd things afoot now, in the Villa Straylight.
      How the mail files formatted? MailTools only deals with Internet mail messages and mbox formatted files with special handling for the "From " separator. The files that Exchange uses could be mbox files but I doubt it. Especially since the messages you posted don't contain the characteristic "From " line. You probably have to figure out how the split the files into individual messages before feeding them to Mail::Header.
        The mail comes into Exchange, it is then run through SpamAssassin. If it is seen as spam, a copy is saved into the "spam" folder. If the mail is seen as ham, then into the "ham" directory a copy goes.

        The mail is "mail" enough to work with SpamAssassin when it comes in, I then save that mail out to the file system, one messageage per file.

        So each of those 3 examples are a separate file. Which I have then attempted to feed into Mail::Header, and which then fails (when seen in Data::Dumper, it just loads the headers and the body all into the body tag).

        As for not having the "From" line, I'm assuming you mean that there is something about the From lines that is missing. On my own personal Unix system, I keep track of mail stats as well and when looking at those files, they look the same as these files do - the difference likely being something that doesn't show up in TextPad or in Less (meaning a \n or \r).

        -------------------------------------------------------------------
        There are some odd things afoot now, in the Villa Straylight.