For a long time, we've wanted to catch mis-nested HTML tags so that mistakes (or malice) in one node can't interfere with the display of elements that appear after the node.

The first attempt was to move from our own hand-rolled regex to a standard CPAN module that could parse and clean up HTML. This was pretty well shot down by being 10-times slower than our regex (plus not being as well suited for dealing with hand-written HTML).

Later, libXML showed up and looked promising. Unfortunately, testing showed that, although it can be configured to be forgiving of broken HTML and try to correct it, even in that mode it is quite easy to get it to die (with mis-nested HTML).

After a few years of thinking about the problem very infrequently, the subject was brought up again and I suddenly felt that I might have a handle on how to do a pretty good job of solving the problem rather simply.

I threw some code I came up with (but didn't even try to compile) on tye's scratchpad and knew I'd come back to it at some point in the distant future.

The other day, I couldn't access the code I wanted to work on so I entertained myself with this old code instead.

Now testing

After several rounds of testing, moving closer and closer to the PerlMonks' "production" environment, I've now made the code available to be used on PerlMonks (as of Monday afternoon, California time).

I encourage you to go to user display settings and turn on the 'enforce proper nesting of HTML' option. This option will go away (becoming mandatory) when the feature has been tested enough.

In addition, you can append ;htmlnest=1 to any PerlMonks URL to enable the feature temporarily. Note that doing so (if you haven't also enabled the option in user display settings) will make visible the previous recent (ugly) hack to prevent unclosed tags in the chatterbox from running amok. This side-effect if partly due to laziness and partly to provide an easy-to-find example where you can see the feature have an effect.

If you find a problem, reply in this thread.

If you have HTML nesting enforced (by either of the above methods), then you can add ;htmlerror=1 to a PerlMonks URL to have missing closing tags displayed in grey (with span class="htmlerror"). This will probably be enabled when previewing (with an option to turnit off after the first preview).


The proper nesting of HTML tags is enforced via the following rules and the exceptions that follow them.

The HTML you type in is scanned from beginning to end. When an opening tag (or 'empty' tag like BR, HR, or IMG) is encountered, if it isn't on the list of approved tags (Perl Monks Approved HTML tags or PerlMonks Approved Chatter HTML Tags), then it is encoded into HTML entities so that it will get displayed literally (this part isn't new -- see More HTML escaping).

If the opening/empty tag is approved, then any attributes present are filtered: unapproved attributes are silently thrown away, unquoted attribute values have quotes added, any square brackets are converted to HTML entities, a trailing " /" is added (if missing) for empty tags, spacing is normalized, duplicate attributes are removed, and the tag name and attribute names are all converted to lowercase. Note that, regardless of any HTML standards, PerlMonks does not let you include a literal < nor > inside of HTML tags. (Most of this isn't new.)

Opening (non-empty) tags are tracked to ensure they get closed in the reverse order.

When a closing tag is found, if that tag has never been opened, then the tag is converted to HTML entities so that it will appear literally. If it is not a block-level tag and was opened in a previous block (not in the current block) then it is also escaped so it will appear literally (a misplaced non-block-level closing tags won't force any blocks to be closed).

[ Block-level (or block-like) tags (versus in-line or character-level tags) are defined bythe HTML standard. For PerlMonks HTML filtering, the block-level tags are: H1..H6, DL, UL, OL, PRE, P, DIV, BLOCKQUOTE, FORM, and TABLE. ]

Otherwise, the closing tag is kept but is preceeded by whatever closing tags are needed to close any tags that were opened after this one.

A few tags are designated as non-nesting. If you open one of these tags twice inside the same block, then instead of nesting, the first tag is closed (along with any nested tags) before the second tag is opened. For PerlMonks HTML filtering, the non-nesting tags are: LI, TR, TH, TD, and P. Note that you can nest these tags by enclosing the inner one inside of a block tag.

When we reach the end of your typing, we close any tags you left open.

Any closing tags that had to be inserted will also be displayed if ;htmlerror=1 was present in the PerlMonks URL.

One other way that PerlMonks intentionally departs from standard HTML is how it handles comments. PerlMonks HTML comments simply start with <!-- and simply end with -->. Any occurrances of "--" inside the comment get changed to "- -" so that the result is always a standards-complient HTML comment. Using an HTML comment like <! -- foo -- > will cause the < to be displayed literally, since it isn't part of a PerlMonks approved HTML tag.

I'll include some examples in a reply (inside a READMORE so they won't be obnoxious to monks who don't have 'htmlnest' enabled).

- tye        

Replies are listed 'Best First'.
Re: Proper nesting of HTML to be enforced
by theorbtwo (Prior) on Feb 03, 2004 at 12:58 UTC

    Additional problems thus far, found but not yet reported:

    • chatterbox faq dies. I'm currently looking into that, but nothing so far.
    • /me yadda </i>foo<i> bar. does not work. It seems to escape the first </i>, then insert a missing </i> at the end. I havn't looked into this yet, but suspect a gready/nongreedy problem.
    • A missing tag in a context where font or span is not allowed, combined with htmlerror=1 mode, will insert the font and span tags, then escape them. I havn't looked into this, and this will likely be unfixable, or at least, this requires a deep knowlage of the regex -- I'll leave it to tye.
    • <font color=""><span class=""> is used instead of <font color="" class="">. Noted here, but not fixed, because it requires changes to the .pm file, and "might as well" be batched with the other required changes.

    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      chatterbox faq dies

      Interesting. It appears that ($q->param('htmlnest'))[-1] is sometimes complaining of "Can't use an undefined value as an ARRAY reference", which is strange since that code shouldn't be trying to do what it is complaining about not being able to do (but I don't see any other candidates for the complaint near line 205). I'll look at it some more.

      Update: Fixed. The line number reported for the error was way off (probably because it was inside of a multi-line s{}{}gsex block).

      /me [broken]

      That is now fixed thanks to theorbtwo. Note that ;htmlerror=1 will show greyed </i>s on the end of any /me chatter lines unless the author included their own </i>. This is by design as otherwise you'd see a visible </i> if someone chats "/me reads </i>The Times", whether htmlerror is enabled or not.

      A missing tag in a context where font or span is not allowed, combined with htmlerror=1 mode, will insert the font and span tags, then escape them.

      Uh, no, it won't. If you've seen this happen, then point it out to me. I could happen if content gets filtered twice, but I think that was only happening for chatter and the outer filter already allowed FONT and SPAN and I think you've removed that outer filter now (thanks), so I don't think this is a problem anywhere.

      [use] <font color="" class="">

      Works for me. I'll change that the next time I touch the code.

      Update: Fixed.


      - tye        

Re: Proper nesting of HTML to be enforced
by allolex (Curate) on Feb 03, 2004 at 10:18 UTC

    Am I guessing right my thinking that this is a step toward making a (planned) future site redesign easier? I was really impressed with the redesign of Slashdot (although PM is not the sloppy hack that Slashcode seems to be), and I don't spend nearly as much time there as I do here. Maybe we could make even Abigail happy about the site design ;)

    I would be really interested to read your thoughts on this.


      Redesigning the site is always planned and often gets worked on already (and quite a few changes are already in production). But, no, this change really has nothing to do with that.

      I was thinking of creating a Quest for HTML design for the site, concentrating on how a thread with nested replies should be displayed. I mostly find PerlMonks' display of such to be easier to understand than most other forums I've looked at, but there are several specific places I think could be improved (space wasted by "[reply]" links, move votes to the bottom of nodes, more support for customizing via CSS, vertical lines to make nesting easier to trace, ...).

      ar0n started a redesign, but it was more toward just CSS issues and, as I recall, wouldn't support non-CSS browsers as well as our current layout (and the more I learn about CSS, the more I think we shouldn't degrade the presentation for non-CSS browsers just because we are making customization via CSS easier).

      - tye        

      Maybe we could make even Abigail happy about the site design ;)
      Hmmm? You mean posting in POD or plain text?


        I suppose it would be fairly easy to support posting in plain text. Maybe adding something like <code> tags, but for text. Of course it's really nice to be able to download the code bits and leave the description. I have noticed that some people just post commented code on occasion.

        Now supporting POD would be really interesting. It's been brought up before (by petdance), but no one seems to have followed up on it.


      I would think not. I haven't heard of any plans and really can't see the point. Abigail just prefers his news reader to any web interface.
Problem with italics after actions in the CB
by allolex (Curate) on Feb 03, 2004 at 10:58 UTC
    [rob_au]: Interesting ... I have enabled the closing HTML setting whic +h tye has been playing with and it looks like the CB has got a proble +m with closing </i> from the /me ... the entire CB after theorbtwo's +/me is in italics for me [allolex]: Yeah, I was just remarking on that. Annoying, innit? [allolex] thinks that will need to be fixed. [rob_au]: Nah, doesn't bother me too much [ambrus]: strange... [theorbtwo]: Reply on the thread there, and tye will look at it.


      OK, I got to it before giving him a chance. It was just a silly problem where the closing </i> was missing from all /me's, because it was provided always, if the new code was off. Fixed. The fact that the new code let the error propogate past that single line of chatter, and into the rest of the chatterbox (but not the rest of the page) isn't the fault of the new code, but rather how often we "screen" the HTML -- once per rendering of the nodelet, rather then once per line.

      Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

Re: Proper nesting of HTML to be enforced (the code)
by tye (Sage) on Feb 03, 2004 at 20:49 UTC

    Here is the code that does this. $q is an Everything::CGI object <which is just like a CGI object except that escapeHTML() also turns ] and [ into entities>. Otherwise, there is very little Everything-specific code so it should be easy to adapt for use in your favorite chatterbox client.

    Updated: Bug fix applied ($block++ used to be in an elsif instead of just if).

    - tye        

Re: Proper nesting of HTML to be enforced
by ambrus (Abbot) on Feb 03, 2004 at 10:00 UTC

    Then we won't be able to type /me</i> text to the CB...

Re: Proper nesting of HTML to be enforced (bug!)
by tye (Sage) on Feb 03, 2004 at 21:52 UTC
Re: Proper nesting of HTML to be enforced
by footpad (Abbot) on Feb 03, 2004 at 18:28 UTC


    I am also very pleased to see that <BR> tags are also correctly rendered now (<BR />), though that may have been done in a different tweak.

    I think this will save the janitors a lot of time. (It will save me a lot of time, as those two edits were among my most frequent clean-up jobs.)

    Now all we have to do is incorporate a good spell-checker. :-)

    Well done!

Re: Proper nesting of HTML to be enforced
by ysth (Canon) on Feb 05, 2004 at 18:47 UTC
    The qandaeditors' wiki has:
    <B>Above these lines...<p>...top.</B>
    that seems to be mishandled. It escapes the /B tag, but isn't adding a </b> before the <p>. (Assuming that the original is bad html, that's what I'm guessing it should be doing.)

      That's a design issue for the filter. The B is miss-nested, ending in a different block than it starts. Closing a non-block tag won't force a block tag to be closed. Opening a new block doesn't close non-block tags to prevent (huge?) thrash in existing 'sloppy' HTML that works in most browsers, regardless of what the standard says.

      They aren't perfect heuristics but they seem to work pretty well and are less restrictive than an HTML validator. Of course, it is quite simple and so is not nearly as DWIM as most browsers.

      You can take:

      <B>Above these lines...<p>...top.</B>
      which renders (with your current settings) as
      Above these lines...


      and gets filtered to
      <b>Above these lines...<p>...top.&lt;/B&gt;</p></b>
      and make it valid by changing it to
      <B>Above these lines...</b><p><b>...top.</B> ^^^^ ^^^
      Above these lines...


      or you can just make the PM filter happy with
      <B>Above these lines...<p>...top.</p></B> ^^^^
      Above these lines...


      (which shows the whole thing in bold in my browser, though the HTML validator complains about it)

      If you wanted to see the PM filter close the B tag because of the P tag, you'd need to

      <p><B>Above these lines...<p>...top.</B> ^^^

      Above these lines...


      which would get filtered into
      <p><b>Above these lines...</b></p><p>...top.&lt;/B&gt;</p>

      Above these lines...


      If you have a suggestion for a better heuristic, my ears are open. It'd be interesting to see how many existing nodes would be changed by having a new block close any non-block tags. I'd expect quite a few, but I don't have evidence of that.

      Updated several times right after creation to add more detail in hopes of making things as clear as possible.

      - tye