Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Distinguish between HTML and Plain text

by ikegami (Pope)
on Sep 26, 2011 at 23:11 UTC ( #927973=note: print w/ replies, xml ) Need Help??


in reply to Distinguish between HTML and Plain text

Impossible. At best, you can take a guess. But you can guess very reliably because HTML must have an HTML element.

If you don't know if it's text or HTML, then you're surely dealing with bytes, so you need to handle UTF-16le, UTF-16be, UCS-2le, UCS-2be, UCS-4le, UCS-4be:

/<HTML|<\0H\0T\0M\0L|<\0\0\0H\0\0\0T\0\0\0M\0\0\0L/

If you're somehow dealing with decoded text:

/<HTML/

Update: No, that's still not good enough. A text version of this very post would fail, for example.


Comment on Re: Distinguish between HTML and Plain text
Select or Download Code
Re^2: Distinguish between HTML and Plain text
by vit (Pilgrim) on Sep 26, 2011 at 23:26 UTC
    But you can guess very reliably because HTML must have an HTML element
    I forgot to mention that the html entered may be just a part of HTML, so assuming presence of "<html" tag will not work.

      This is HTML:

      Please use <code>...</code> tags around your code.

      This is text:

      Please use <code>use strict;</code> in your code.

      How can one possibly correctly identify them programatically?

      PS - This is the reason Atom is better than RSS. RSS doesn't provide a mean of specifying the content type, so it can't distinguish between text and HTML content. Clients have to guess. You could take a peek at how RSS clients do it, but I suspect they might work with less ambiguous content than you.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://927973]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (16)
As of 2014-07-24 13:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (160 votes), past polls