monkdiscuss
tye
<h4>History</h4>
<p>
For a long time, we've wanted to catch
mis-nested HTML tags so that mistakes (or
malice) in one node can't interfere with the
display of elements that appear after the
node.
</p><readmore><p>
The first attempt was to move from our own
hand-rolled regex to a standard CPAN module
that could parse and clean up HTML. This was
pretty well shot down by being 10-times slower than our regex (plus not being as well suited
for dealing with hand-written HTML).
</p><p>
Later, libXML showed up and looked promising.
Unfortunately, testing showed that, although it can be configured to be forgiving of broken
HTML and try to correct it, even in that mode
it is quite easy to get it to [die] (with
mis-nested HTML).
</p><p>
After a few years of thinking about the
problem very infrequently, the subject was
brought up again and I suddenly felt that I
might have a handle on how to do a pretty good job of solving the problem rather simply.
</p><p>
I threw some code I came up with (but didn't
even try to compile) on [pad://tye] and knew
I'd come back to it at some point in the
distant future.
</p><p>
The other day, I couldn't access the code I
wanted to work on so I entertained myself with this old code instead.
</p></readmore>
<h4>Now testing</h4>
<p>
After several rounds of testing, moving closer and closer to the PerlMonks' "production"
environment, I've now made the code available
to be used on PerlMonks (as of Monday
afternoon, California time).
</p><p>
I encourage you to go to <del>user</del> [display settings] and
turn on the 'enforce proper nesting of HTML'
option. This option will go away (becoming
mandatory) when the feature has been tested
enough.
</p><p>
In addition, you can append ;htmlnest=1 to
any PerlMonks URL to enable the feature
temporarily. Note that doing so (if you
haven't also enabled the option in <del>user</del> [display settings]) will make visible the previous
recent (ugly) hack to prevent unclosed tags
in the chatterbox from running amok. This
side-effect if partly due to laziness and
partly to provide an easy-to-find example
where you can see the feature have an effect.
</p><p>
If you find a problem, reply in this thread.
</p><p>
If you have HTML nesting enforced (by either
of the above methods), then you can add
;htmlerror=1 to a PerlMonks URL to have missing closing tags displayed in grey (with span
class="htmlerror"). This will probably be
enabled when previewing (with an option to turnit off after the first preview).
</p>
<h4>Details</h4>
<p>
The proper nesting of HTML tags is enforced
via the following rules and the exceptions
that follow them.
</p><readmore><p>
The HTML you type in is scanned from beginning to end. When an opening tag (or 'empty' tag
like BR, HR, or IMG) is encountered, if it
isn't on the list of approved tags
([id://29281] or [id://243116]), then it is
encoded into HTML entities so that it will get displayed literally (this part isn't new --
see [More HTML escaping]).
</p><p>
If the opening/empty tag is approved, then any attributes present are filtered: unapproved
attributes are silently thrown away, unquoted
attribute values have quotes added, any square brackets are converted to HTML entities, a
trailing " /" is added (if missing) for empty
tags, spacing is normalized, duplicate
attributes are removed, and the tag name and
attribute names are all converted to lowercase. Note that, regardless of any HTML standards,
PerlMonks does not let you include a literal
< nor > inside of HTML tags. (Most of this isn't new.)
</p><p>
Opening (non-empty) tags are tracked to ensure they get closed in the reverse order.
</p><p>
When a closing tag is found, if that tag has
never been opened, then the tag is converted
to HTML entities so that it will appear
literally. If it is not a block-level tag and was opened in a previous block (not in the
current block) then it is also escaped so it
will appear literally (a misplaced
non-block-level closing tags won't force any
blocks to be closed).
</p><p>
[ Block-level (or block-like) tags (versus in-line or character-level tags) are defined bythe HTML standard. For PerlMonks HTML
filtering, the block-level tags are: H1..H6,
DL, UL, OL, PRE, P, DIV, BLOCKQUOTE, FORM,
and TABLE. ]
</p><p>
Otherwise, the closing tag is kept but is
preceeded by whatever closing tags are needed
to close any tags that were opened after this
one.
</p><p>
A few tags are designated as non-nesting. If
you open one of these tags twice inside the
same block, then instead of nesting, the first
tag is closed (along with any nested tags)
before the second tag is opened. For PerlMonks HTML filtering, the non-nesting tags are: LI,
TR, TH, TD, and P. Note that you can nest
these tags by enclosing the inner one inside
of a block tag.
</p><p>
When we reach the end of your typing, we close any tags you left open.
</p><p>
Any closing tags that had to be inserted will
also be displayed if ;htmlerror=1 was present
in the PerlMonks URL.
</p><p>
One other way that PerlMonks intentionally
departs from standard HTML is how it handles
comments. PerlMonks HTML comments simply
start with <!-- and simply end with -->.
Any occurrances of "--" inside the comment
get changed to "- -" so that the result is
always a standards-complient HTML comment.
Using an HTML comment like
<code><! -- foo -- ></code> will cause the < to be displayed
literally, since it isn't part of a PerlMonks
approved HTML tag.
</p></readmore><p>
I'll include some examples in a reply (inside
a READMORE so they won't be obnoxious to monks
who don't have 'htmlnest' enabled).
</p>
<div class="pmsig"><div class="pmsig-22609"><p align="right">
- [tye]<tt> </tt>
</p></div></div>