Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Parsing mail(mail::message)

by cdlaforc (Novice)
on Dec 05, 2012 at 22:23 UTC ( #1007419=perlquestion: print w/ replies, xml ) Need Help??
cdlaforc has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm trying to create a perl job that will read a mail message in from standard input and parse out a few details such as the from property, the subject, and the body. I think I will be okay with the from and subject properties, but I need a little help with the body portion. I am currently using the mail::message module to just output the body of the message.
#!/usr/bin/perl -w use strict; use Mail::Message; my $msg = Mail::Message->read(\*STDIN); print $msg->body;
If I send an email from my gmail account the message looks pretty simple, and I don't believe I would have any trouble parsing it.

Example gmail email:
Yes On 12/5/12, xxxxxxxxxxxx@gundluth.org <xxxxxxxxxxxx@gundluth.org> wrot +e: > Test
but if I send an email from my outlook account it appears to send 2 versions of the email. One with a content type of text/plan and one with a content type of text/html. I'm wondering what the best way to handle these differences.

Example outlook email:
--_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable yesno Thanks. -------------------------------------------------- Chris From: xxxxxxxxxxxx@xxxxxxxx.org [mailto:xxxxxxxxxxxx@xxxxxxxx.org] Sent: Friday, November 30, 2012 7:56 AM To: yyyyyyy, Chris D Subject: hello Test --_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable <html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas +-micr= osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:w +ord" = xmlns:x=3D"urn:schemas-microsoft-com:office:excel" xmlns:m=3D"http://s +chema= s.microsoft.com/office/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC +-html= 40"> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-a +scii"= > <meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium +)"> <style><!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:"Bookman Old Style"; panose-1:2 5 6 4 5 5 5 2 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman","serif";} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; color:black;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri","sans-serif";} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D"edit"> <o:idmap v:ext=3D"edit" data=3D"1" /> </o:shapelayout></xml><![endif]--> </head> <body lang=3D"EN-US" link=3D"blue" vlink=3D"purple"> <div class=3D"WordSection1"> <p class=3D"MsoNormal"><span style=3D"color:black">yesno<o:p></o:p></s +pan><= /p> <p class=3D"MsoNormal"><span style=3D"color:black"><o:p>&nbsp;</o:p></ +span>= </p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"><o:p>&nbsp;</o:p> +</spa= n></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">Thanks.<o:p></o:p +></sp= an></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">----------------- +-----= ----------------------------<o:p></o:p></span></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040">Chris yyyyyyy<o:p +></o:= p></span></b></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"> + = <o:p></o:p></span></b></p> <p class=3D"MsoNormal" style=3D"line-height:12.0pt;text-autospace:none +"><sp= an style=3D"font-size:11.0pt;font-family:&quot;Bookman Old Style&quot; +,&quo= t;serif&quot;;color:#4F6228">xxxxxxxxx</span><span style=3D"font-size: +11.0p= t;font-family:&quot;Bookman Old Style&quot;,&quot;serif&quot;;color:#7 +6923C= "> </span><span style=3D"font-size:11.0pt;font-family:&quot;Bookman Old S +tyle&= quot;,&quot;serif&quot;;color:#365F91"> </span><span style=3D"f +ont-s= ize:11.0pt;font-family:&quot;Bookman Old Style&quot;,&quot;serif&quot; +;colo= r:black"> <o:p></o:p></span></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"> ; Mai +lstop= : NCA2-02</span></b><b><span style=3D"font-size:11.0pt;font-family:&qu +ot;Ca= libri&quot;,&quot;sans-serif&quot;;color:#404040"><o:p></o:p></span></ +b></p= > <p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family: +&quot= ;Calibri&quot;,&quot;sans-serif&quot;;color:#404040"><a href=3D"mailto +:yyyy= yyyy@yyyyyyyy.org"><span style=3D"color:blue">yyyyyyyy@yyyyyyyy.org</s +pan><= /a></span></b><span style=3D"font-size:11.0pt;font-family:&quot;Calibr +i&quo= t;,&quot;sans-serif&quot;;color:black"><o:p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"color:black"><o:p>&nbsp;</o:p></ +span>= </p> <p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family: +&quot= ;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"f +ont-s= ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> xxx +xxxxx= xxxx@xxxxxxxx.org [mailto:xxxxxxxxxxxx@xxxxxxxx.org] <br> <b>Sent:</b> Friday, November 30, 2012 7:56 AM<br> <b>To:</b> yyyyyyy, Chris D<br> <b>Subject:</b> hello<o:p></o:p></span></p> <p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p> <p class=3D"MsoNormal"><b><span style=3D"font-size:36.0pt">T</span>est +</b><= o:p></o:p></p> </div> </body> </html> --_000_4AADDBF32F49F74ABBE5DFCBCC0B0193121F0F73ONEXMB06xxxxxxx_--
Please let me know if I can offer any more details. Any examples would help tremendously. Thanks, Chris.

Comment on Parsing mail(mail::message)
Select or Download Code
Replies are listed 'Best First'.
Re: Parsing mail(mail::message)
by GrandFather (Sage) on Dec 05, 2012 at 23:20 UTC

    Outlook (and most email clients) is sending your HTML/RTF email body along with a plain text rendition of the email. I'm surprised actually that gmail isn't - you must have it configured to send plain text only.

    I use the following code in a system that parses commands sent by email to an automated build and test system:

    sub ParseEmail { my ($emailStr) = @_; my $parser = new MIME::Parser; my %fields; $parser->tmp_to_core(1); $parser->output_to_core(1); my $entity = $parser->parse_data($emailStr); my @parts = $entity->parts(); my $head = $entity->head(); $fields{subject} = $head->get('subject') // ''; $fields{subject} =~ s/^\s*(re:\s*)+//i; $fields{from} = $head->get('from') // ''; $fields{from} =~ s/^"([^"]+)"/$1/; $fields{ccList} = $head->get('Cc') // ''; $fields{to} = $head->get('To') // ''; $fields{date} = $head->get('Date') // ''; if (!@parts) { $fields{body} = $entity->bodyhandle()->as_string(); } else { $fields{body} = _parseParts(@parts); } return %fields; } sub _parseParts { my $savedText = ''; for my $part (@_) { my $type = $part->effective_type(); if (-1 < index $type, 'multipart') { my @subParts = $part->parts(); $savedText = _parseParts(@subParts); } elsif ($type eq 'text/plain') { return $part->stringify_body(); } elsif ($type eq 'text/html') { my $str = $part->stringify_body(); my $tree = HTML::TreeBuilder->new_from_content($str); $savedText = $tree->as_text(); } } return $savedText; }

    Note that the heavy lifting is done by MIME::Parser and HTML::TreeBuilder. _parseParts returns the first plain text part or the text of the first HTML part it finds.

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1007419]
Approved by bitingduck
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2015-07-29 05:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (260 votes), past polls