http://www.perlmonks.org?node_id=1007419

cdlaforc has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm trying to create a perl job that will read a mail message in from standard input and parse out a few details such as the from property, the subject, and the body. I think I will be okay with the from and subject properties, but I need a little help with the body portion. I am currently using the mail::message module to just output the body of the message.
#!/usr/bin/perl -w use strict; use Mail::Message; my $msg = Mail::Message->read(\*STDIN); print $msg->body;
If I send an email from my gmail account the message looks pretty simple, and I don't believe I would have any trouble parsing it.

Example gmail email:
Yes On 12/5/12, xxxxxxxxxxxx@gundluth.org <xxxxxxxxxxxx@gundluth.org> wrot +e: > Test
but if I send an email from my outlook account it appears to send 2 versions of the email. One with a content type of text/plan and one with a content type of text/html. I'm wondering what the best way to handle these differences.

Example outlook email:
--_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable yesno Thanks. -------------------------------------------------- Chris From: xxxxxxxxxxxx@xxxxxxxx.org [mailto:xxxxxxxxxxxx@xxxxxxxx.org] Sent: Friday, November 30, 2012 7:56 AM To: yyyyyyy, Chris D Subject: hello Test --_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable <html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas +-micr= osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:w +ord" = xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns:m=3D"http://s +chema= s.microsoft.com/office/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC +-html= 40"> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-a +scii"= > <meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium +)"> <style><!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:"Bookman Old Style"; panose-1:2 5 6 4 5 5 5 2 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman","serif";} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; color:black;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri","sans-serif";} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D"edit"> <o:idmap v:ext=3D"edit" data=3D"1" /> </o:shapelayout></xml><![endif]--> </head> <body lang=3D"EN-US" link=3D"blue" vlink=3D"purple"> <div class=3D"WordSection1"> <p class=3D"MsoNormal"><span style=3D"color:black">yesno<o:p></o:p></s +pan><= /p> <p class=3D"MsoNormal"><span style=3D"color:black"><o:p>&nbsp;</o:p></ +span>= </p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"><o:p>&nbsp;</o:p> +</spa= n></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">Thanks.<o:p></o:p +></sp= an></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">----------------- +-----= ----------------------------<o:p></o:p></span></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">Chris yyyyyyy<o:p +></o:= p></span></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"> + = <o:p></o:p></span></b></p> <p class=3D"MsoNormal" style=3D"line-height:12.0pt;text-autospace:none +"><sp= an style=3D"font-size:11.0pt;font-family:&quot;Bookman Old Style&quot; +,&quo= t;serif&quot;;color:#4F6228">xxxxxxxxx</span><span style=3D"font-size: +11.0p= t;font-family:&quot;Bookman Old Style&quot;,&quot;serif&quot;;color:#7 +6923C= "> </span><span style=3D"font-size:11.0pt;font-family:&quot;Bookman Old S +tyle&= quot;,&quot;serif&quot;;color:#365F91"> </span><span style=3D"f +ont-s= ize:11.0pt;font-family:&quot;Bookman Old Style&quot;,&quot;serif&quot; +;colo= r:black"> <o:p></o:p></span></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"> ; Mai +lstop= : NCA2-02</span></b><b><span style=3D"font-size:11.0pt;font-family:&qu +ot;Ca= libri&quot;,&quot;sans-serif&quot;;color:#404040"><o:p></o:p></span></ +b></p= > <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"><a href=3D"mailto +:yyyy= yyyy@yyyyyyyy.org"><span style=3D"color:blue">yyyyyyyy@yyyyyyyy.org</s +pan><= /a></span></b><span style=3D"font-size:11.0pt;font-family:&quot;Calibr +i&quo= t;,&quot;sans-serif&quot;;color:black"><o:p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"color:black"><o:p>&nbsp;</o:p></ +span>= </p> <p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family: +&quot= ;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"f +ont-s= ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> xxx +xxxxx= xxxx@xxxxxxxx.org [mailto:xxxxxxxxxxxx@xxxxxxxx.org] <br> <b>Sent:</b> Friday, November 30, 2012 7:56 AM<br> <b>To:</b> yyyyyyy, Chris D<br> <b>Subject:</b> hello<o:p></o:p></span></p> <p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:36.0pt">T</span>est +</b><= o:p></o:p></p> </div> </body> </html> --_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_--
Please let me know if I can offer any more details. Any examples would help tremendously. Thanks, Chris.

Replies are listed 'Best First'.
Re: Parsing mail(mail::message)
by GrandFather (Saint) on Dec 05, 2012 at 23:20 UTC

    Outlook (and most email clients) is sending your HTML/RTF email body along with a plain text rendition of the email. I'm surprised actually that gmail isn't - you must have it configured to send plain text only.

    I use the following code in a system that parses commands sent by email to an automated build and test system:

    sub ParseEmail { my ($emailStr) = @_; my $parser = new MIME::Parser; my %fields; $parser->tmp_to_core(1); $parser->output_to_core(1); my $entity = $parser->parse_data($emailStr); my @parts = $entity->parts(); my $head = $entity->head(); $fields{subject} = $head->get('subject') // ''; $fields{subject} =~ s/^\s*(re:\s*)+//i; $fields{from} = $head->get('from') // ''; $fields{from} =~ s/^"([^"]+)"/$1/; $fields{ccList} = $head->get('Cc') // ''; $fields{to} = $head->get('To') // ''; $fields{date} = $head->get('Date') // ''; if (!@parts) { $fields{body} = $entity->bodyhandle()->as_string(); } else { $fields{body} = _parseParts(@parts); } return %fields; } sub _parseParts { my $savedText = ''; for my $part (@_) { my $type = $part->effective_type(); if (-1 < index $type, 'multipart') { my @subParts = $part->parts(); $savedText = _parseParts(@subParts); } elsif ($type eq 'text/plain') { return $part->stringify_body(); } elsif ($type eq 'text/html') { my $str = $part->stringify_body(); my $tree = HTML::TreeBuilder->new_from_content($str); $savedText = $tree->as_text(); } } return $savedText; }

    Note that the heavy lifting is done by MIME::Parser and HTML::TreeBuilder. _parseParts returns the first plain text part or the text of the first HTML part it finds.

    True laziness is hard work