http://www.perlmonks.org?node_id=845510

pdxperl has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading email from a pop server and logging it into a tracking system. Since the email is coming from users, it can be plain text, or HTML, or a mix. All I'd like to do is extract the plain text. If it is a mime-encoded message, just stripping HTML means that I can wind up with two copies of the message (plain text section + filtered HTML section) plus some mime-boundary stuff, so that doesn't seem like the best way to go. I'm guessing that I need to look at the email, decide if it's mime encoded, and then see if there is a plain text section. So there's three cases (?) of emails: 1) Not Mime, so no decoding needed, just read msg body 2) Mime-encoded with plain text section -> extract plain text section 3) Mime-encoded, no plain text section, just HTML -> decode the HTML Seems like a fair amount of effort, so I wondered if someone else has solved this in better way. I didn't see anything in CPAN that would be a complete solution (ie, nothing entitled Mail::ReadAnythingAndExtractPlainText)
  • Comment on Reading POP email that may be plain text, HTML, or both

Replies are listed 'Best First'.
Re: Reading POP email that may be plain text, HTML, or both
by almut (Canon) on Jun 19, 2010 at 09:34 UTC

    You could use Email::MIME, but I think you'll somehow have to handle the separate cases...  As multipart messages can in theory be arbitrarily nested, a recursive approach is appropriate.  The following example returns the first acceptable (i.e. plain text or html) part found:

    use Email::MIME; sub handle_parts { my $part = shift; my $content_type = $part->content_type; #print "Content-Type: $content_type\n"; # debug my $body = $part->body; if ($content_type =~ m#text/plain#) { return $body; } elsif ($content_type =~ m#text/html#) { return html2text($body); } elsif ($content_type =~ m#multipart/#) { for my $subpart ($part->parts) { my $text = handle_parts($subpart); return $text if defined $text; } } return; } sub html2text { my $html = shift; # my $text = ... (left as an exercise) # return $text; return $html; } my $message = ... my $parsed = Email::MIME->new($message); my $text = handle_parts($parsed);

    You might need some additional checks to figure out if the first content part is the one you want...

      Thanks! This approach worked quite well. It seems like it would be a useful function to add to one of the many email modules on CPAN
Re: Reading POP email that may be plain text, HTML, or both
by Krambambuli (Curate) on Jun 19, 2010 at 11:45 UTC
    Have a look on MIME::Tools and MIME::Entity. Parse the message, then check the parts to see if they are text/plain, pick up the parts you need. For HTML-only messages you'll probably have to decide if you want to ignore them or transform HTML to text.