Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Distinguish between HTML and Plain text

by vit (Pilgrim)
on Sep 26, 2011 at 22:57 UTC ( #927970=perlquestion: print w/ replies, xml ) Need Help??
vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I know there are many ways to handle it. But I would prefer to use some popular and robust if exists.
I simply need to determine whether the entered text is an HTML or a plain text.

Comment on Distinguish between HTML and Plain text
Re: Distinguish between HTML and Plain text
by ikegami (Pope) on Sep 26, 2011 at 23:11 UTC

    Impossible. At best, you can take a guess. But you can guess very reliably because HTML must have an HTML element.

    If you don't know if it's text or HTML, then you're surely dealing with bytes, so you need to handle UTF-16le, UTF-16be, UCS-2le, UCS-2be, UCS-4le, UCS-4be:

    /<HTML|<\0H\0T\0M\0L|<\0\0\0H\0\0\0T\0\0\0M\0\0\0L/

    If you're somehow dealing with decoded text:

    /<HTML/

    Update: No, that's still not good enough. A text version of this very post would fail, for example.

      But you can guess very reliably because HTML must have an HTML element
      I forgot to mention that the html entered may be just a part of HTML, so assuming presence of "<html" tag will not work.

        This is HTML:

        Please use <code>...</code> tags around your code.

        This is text:

        Please use <code>use strict;</code> in your code.

        How can one possibly correctly identify them programatically?

        PS - This is the reason Atom is better than RSS. RSS doesn't provide a mean of specifying the content type, so it can't distinguish between text and HTML content. Clients have to guess. You could take a peek at how RSS clients do it, but I suspect they might work with less ambiguous content than you.

Re: Distinguish between HTML and Plain text
by JavaFan (Canon) on Sep 27, 2011 at 01:10 UTC
    You cannot. Remember that the content of P elements can consist of just PCDATA. Which can just be "plain text". And even if you have a piece of data that validates against an HTML DTD, you still cannot know whether the author intended it as HTML, or as plain text.

    If you need to know, you either have to use some heuristics (for instance, it "validates", either against a DTD or the more usual "my browser doesn't barf on it"), or ask the user.

Re: Distinguish between HTML and Plain text
by Khen1950fx (Canon) on Sep 27, 2011 at 02:24 UTC
    It can be done. Is it precise? You be the judge:
    #!/usr/bin/perl -l use strict; use warnings; use Text::FromAny; my $log = '/root/Desktop/text.log'; open STDOUT, '>', $log; my $entries= "<TITLE>Page 7</TITLE>"; print $entries; my $tFromAny = Text::FromAny->new(file => $log); print $tFromAny->detectedType; close STDOUT;
    Instead of thinking "can't", think "don't do that". It works, but it's not best practice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://927970]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2014-07-26 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls