Re: Distinguish between HTML and Plain text

in reply to Distinguish between HTML and Plain text

Impossible. At best, you can take a guess. But you can guess very reliably because HTML must have an HTML element.

If you don't know if it's text or HTML, then you're surely dealing with bytes, so you need to handle UTF-16le, UTF-16be, UCS-2le, UCS-2be, UCS-4le, UCS-4be:

/<HTML|<\0H\0T\0M\0L|<\0\0\0H\0\0\0T\0\0\0M\0\0\0L/
[download]

If you're somehow dealing with decoded text:

/<HTML/
[download]

Update: No, that's still not good enough. A text version of this very post would fail, for example.

Comment on Re: Distinguish between HTML and Plain text Select or Download Code

Replies are listed 'Best First'.
Re^2: Distinguish between HTML and Plain text by vit (Friar) on Sep 26, 2011 at 23:26 UTC
But you can guess very reliably because HTML must have an HTML element I forgot to mention that the html entered may be just a part of HTML, so assuming presence of "<html" tag will not work.	[reply]
Re^3: Distinguish between HTML and Plain text by ikegami (Patriarch) on Sep 26, 2011 at 23:36 UTC
This is HTML: `Please use <code>...</code> tags around your code.` [download] This is text: `Please use <code>use strict;</code> in your code.` [download] How can one possibly correctly identify them programatically? PS - This is the reason Atom is better than RSS. RSS doesn't provide a mean of specifying the content type, so it can't distinguish between text and HTML content. Clients have to guess. You could take a peek at how RSS clients do it, but I suspect they might work with less ambiguous content than you.	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom