Reading POP email that may be plain text, HTML, or both

pdxperl has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading email from a pop server and logging it into a tracking system. Since the email is coming from users, it can be plain text, or HTML, or a mix. All I'd like to do is extract the plain text. If it is a mime-encoded message, just stripping HTML means that I can wind up with two copies of the message (plain text section + filtered HTML section) plus some mime-boundary stuff, so that doesn't seem like the best way to go. I'm guessing that I need to look at the email, decide if it's mime encoded, and then see if there is a plain text section. So there's three cases (?) of emails: 1) Not Mime, so no decoding needed, just read msg body 2) Mime-encoded with plain text section -> extract plain text section 3) Mime-encoded, no plain text section, just HTML -> decode the HTML Seems like a fair amount of effort, so I wondered if someone else has solved this in better way. I didn't see anything in CPAN that would be a complete solution (ie, nothing entitled Mail::ReadAnythingAndExtractPlainText)

Comment on Reading POP email that may be plain text, HTML, or both

Replies are listed 'Best First'.
Re: Reading POP email that may be plain text, HTML, or both by almut (Canon) on Jun 19, 2010 at 09:34 UTC
You could use Email::MIME, but I think you'll somehow have to handle the separate cases... As multipart messages can in theory be arbitrarily nested, a recursive approach is appropriate. The following example returns the first acceptable (i.e. plain text or html) part found: use Email::MIME; sub handle_parts { my $part = shift; my $content_type = $part->content_type; #print "Content-Type: $content_type\n"; # debug my $body = $part->body; if ($content_type =~ m#text/plain#) { return $body; } elsif ($content_type =~ m#text/html#) { return html2text($body); } elsif ($content_type =~ m#multipart/#) { for my $subpart ($part->parts) { my $text = handle_parts($subpart); return $text if defined $text; } } return; } sub html2text { my $html = shift; # my $text = ... (left as an exercise) # return $text; return $html; } my $message = ... my $parsed = Email::MIME->new($message); my $text = handle_parts($parsed); [download] You might need some additional checks to figure out if the first content part is the one you want...	[reply] [d/l]
Re^2: Reading POP email that may be plain text, HTML, or both by pdxperl (Sexton) on Jun 20, 2010 at 03:08 UTC
Thanks! This approach worked quite well. It seems like it would be a useful function to add to one of the many email modules on CPAN	[reply]
Re: Reading POP email that may be plain text, HTML, or both by Krambambuli (Curate) on Jun 19, 2010 at 11:45 UTC
Have a look on MIME::Tools and MIME::Entity. Parse the message, then check the parts to see if they are text/plain, pick up the parts you need. For HTML-only messages you'll probably have to decide if you want to ignore them or transform HTML to text.	[reply]

Back to Seekers of Perl Wisdom