http://www.perlmonks.org?node_id=803989

bajangerry has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a perl script that checks a mailbox for messages, and saves the message portion to a MySQL database. My problem here is that sometimes the message is in HTML format which causes a bunch of HTML code to be saved as well. So, what I need to figure out is how do I check if messages are in HTML format and if so, how do I convert the message to plain text?

Thanks for any input.

Gerry

  • Comment on Convert HTML Email message to plain text

Replies are listed 'Best First'.
Re: Convert HTML Email message to plain text
by bichonfrise74 (Vicar) on Oct 29, 2009 at 19:36 UTC
    Look at this HTML::Strip, it looks like it will remove the HTML markup from your text.
Re: Convert HTML Email message to plain text
by zwon (Abbot) on Oct 29, 2009 at 19:34 UTC
Re: Convert HTML Email message to plain text
by Anonymous Monk on Oct 29, 2009 at 17:18 UTC
      Ok, that is more than a little bit confusing for me as there seems to be 50 ways to do this.
        There's at least 50 ways to do it because there's no "right" answer. You're losing semantic information when you go from HTML to plain text, so you have to be the judge of how lossy you want the transfer to be, and what proxies you want to have in the text form for things that cannot be represented.

        -- Randal L. Schwartz, Perl hacker

        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Re: Convert HTML Email message to plain text
by salva (Canon) on Oct 30, 2009 at 12:16 UTC
    You can also use a text browser (as w3m) to convert the HTML to text.

    For instance, this renders PerlMonks as text:

    w3m -dump http://perlmonks.org
      Just thought of something else as well... I need to remove things like Signatures and any images that may be included in the email message, not just HTML.