Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re: Distinguish between HTML and Plain text

by ikegami (Pope)
on Sep 26, 2011 at 23:11 UTC ( #927973=note: print w/replies, xml ) Need Help??

in reply to Distinguish between HTML and Plain text

Impossible. At best, you can take a guess. But you can guess very reliably because HTML must have an HTML element.

If you don't know if it's text or HTML, then you're surely dealing with bytes, so you need to handle UTF-16le, UTF-16be, UCS-2le, UCS-2be, UCS-4le, UCS-4be:


If you're somehow dealing with decoded text:


Update: No, that's still not good enough. A text version of this very post would fail, for example.

Replies are listed 'Best First'.
Re^2: Distinguish between HTML and Plain text
by vit (Pilgrim) on Sep 26, 2011 at 23:26 UTC
    But you can guess very reliably because HTML must have an HTML element
    I forgot to mention that the html entered may be just a part of HTML, so assuming presence of "<html" tag will not work.

      This is HTML:

      Please use <code>...</code> tags around your code.

      This is text:

      Please use <code>use strict;</code> in your code.

      How can one possibly correctly identify them programatically?

      PS - This is the reason Atom is better than RSS. RSS doesn't provide a mean of specifying the content type, so it can't distinguish between text and HTML content. Clients have to guess. You could take a peek at how RSS clients do it, but I suspect they might work with less ambiguous content than you.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://927973]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2016-09-28 17:34 GMT
Find Nodes?
    Voting Booth?
    Extraterrestrials haven't visited the Earth yet because:

    Results (533 votes). Check out past polls.