Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Site HTML filtering, Phase II

by tye (Sage)
on Feb 11, 2004 at 03:32 UTC ( #328127=monkdiscuss: print w/replies, xml ) Need Help??

I've made several minor improvements to the HTML nesting enforcement code based on the testing done so far. Thanks go to the few who I saw "complain," especially ysth. (:

First, note that P is no longer a block tag as far as PerlMonks html nesting is concerned (because people often don't close a P tag and that shouldn't prevent earlier tags from being closed).

The other improvements mostly have to do with how corrections are displayed. Below is my attempt at documenting the htmlerror reporting levels. With luck, the SiteDocClan will improve these into a PM doc node (along with the previous announcement).

So there are two new settings in User Settings. And pretty soon the htmlnest option will go away (it will become the default behavior that can only be disabled by adding ;htmlnest=0 to a URL).

I also took the opportunity to move all of the Nodelet Settings out of the over-full User Settings and clean it up slightly.

HTML error reporting levels

Reporting levels summary

  • Level 0 shows invalid/unapproved HTML tags as plain text.
  • Level 1 also shows non-trivial closing HTML tags that had to be inserted (as grey text).
  • Level 2 also shows invalid/unapproved attributes of approved HTML tags (enclosed in the approved tag; all as grey text).
  • Level 3 also shows non-</p> closing HTML tags that were ignored (as grey text with a line drawn through it).
  • Level 4 also shows trivial closing HTML tags that were ignored or inserted (as grey text; ignored tags have a line drawn through them). Trivial tags are </p> and non-nesting tags inserted other than at the endof the HTML being filtered.

Reporting levels details

No matter what htmlerror reporting is set to, any unrecognized, invalid, or unapproved tags simply have their opening < changed to &lt; so that the tag becomes visible as text (see the More HTML escaping announcement).

When you have htmlerror reporting set to 4 (the maximum), nearly all other corrections made to the HTML will also be made visible, but in a grey font:

  • Any approved closing tags that get ignored in order to enforce proper nesting of tags will be made visible inside of <font color="#808080" class="htmlignored"> tags.

    The default PerlMonks CSS includes font.htmlignored { text-decoration: line-through; } which means the grey text will have a line drawn through it as if <strike> tags had been used (unless your browser does not support CSS). The strike-out allows you to distinguish them from inserted closing tags (and you can use CSS to customize their appearance).

  • Any closing tags that have to be inserted in order to enforce proper nesting of tags will also be displayed, but inside of <font color="#808080" class="htmlinserted"> tags.
  • Any unrecognized, invalid, or unapproved attributes in an approved tag will be displayed inside angle brackets with the tag name. All of this will be inside of <font color="#808080" class="htmlattrib"> tags.

    For example, if IMG is an approved tag with approved attributes of ALT, HEIGHT, and WIDTH, then HTML of <Img ALT=purdy align="top" oops> will be changed to <img alt='purdy' /><font color="#808080" class="htmlattrib">&lt;img align="top" oops></font> so you'll see "<img align="top" oops>" displayed after the image. These are the only non-closing tags that will be displayed in grey.

This level (4) of htmlerror reporting is rather obnoxious and is reserved for when you are composing your own nodes or temporarily request it by adding ;htmlerror=4 to a PerlMonks URL.

If you lower the htmlerror reporting level to 3, then inserted and ignored </p> tags are not displayed. Neither are closing tags that were inserted to close a non-nesting tag other than at the end of the HTML being filtered.

For example, if you have an HTML table that is missing all of its </tr>, </th>, and </td> tags, then these will be inserted but not be displayed (as long as the </table> is not missing).

Level 3 omits showing these most common lapses (that are harmless unless you consider strict compliance to newer HTML standards as a goal in itself) but shows nearly all other mistakes.

Level 2 omits showing ignored closing tags. So it shows non-trivial inserted closing tags (descibed in the next paragraph) and ignored attributes.

Level 1 omits showing ignored attributes. This means that it only shows when tags had to be inserted to close an unclosed or misnested tag (but never shows non-nesting tags unless they were inserted at the end of the filtered HTML, and never shows </p>).

Level 0 (the default) just fixes nesting errors but doesn't display any of them.

User settings

In user settings, you can select between htmlerror reporting levels of 0, 1, 2, or 3 to be used when you view nodes at PerlMonks. You can temporarily select any reporting level (including 4) by appending ;htmlerror=4 (for example) to any PerlMonks URL.1

Note that using an error reporting level of 3 will show you harmless "errors" so you shouldn't select this unless you can deal with seeing a lot of "mistakes" without becoming obnoxious in pointing them out to others.

When you start composing a new node, for thefirst preview you can select between htmlerror reporting levels 3 and 4 (the default choice is also controlled in user settings and defaults to 3). For previews after the first, you can pick any reporting level via a form element on the preview page.

[ The patches to Preview are a bit complicated and haven't been finished. At the time of this writing, the first preview uses your 'preview' level of error reporting and there is no form element for adjusting the level while previewing. ]

1 You can't select 4 as your default error reporting level (except for when previewing your own nodes) because it reports harmless "errors" that we expect to be made often by many members and we don't want to hear complaints about such.

- tye        

Replies are listed 'Best First'.
Re: Site HTML filtering, Phase II
by Abigail-II (Bishop) on Feb 11, 2004 at 10:11 UTC
    My god, how utterly complicated. Can't we just have a setting that puts an implicite <code> and </code> around our postings? I mean, I know plain text, I know POD, I know LaTeX, I know HTML, but in the almost 2 years I'm posting here, I still haven't quite figured out what kind of language I need to speak here, and now it's changing again.

    When will O'Reilly publish a book about the Perlmonk markup language?


      Actually, the Perlmonks stuff is pretty simple. The only hard part is remembering which of the less common but harmless and useful HTML tags don't work. (ISTR that cite doesn't work, but I could be misremembering; maybe it was q that doesn't work. I'm not sure. I often just use them anyway, because when they're what you intend, there's nothing else with the right semantics.) That, and remembering the entity for escaping the left square bracket. (I usually just put code tags around it. Easier to remember.) If you want to see some needlessly complicated and gratuitously different site markup, have a look at Wikipedia sometime. I am continually thankful that Perlmonks markup is mostly just HTML.

      Can't we just have a setting that puts an implicite <code> and </code> around our postings?

      Well, you could always change your node template to that in protest. Such a protest would have about as much impact on the rest of us as Coruscate's XP/reputation/voting protest, but we'd all know where you stand on the issue.

      My first reaction when I read the description of these new changes is that the error checking is quite lenient. I suppose that's a good thing. If I had written the checker, it would probably just reject or escape anything that's not wellformed (in addition to anything that smacks of javascript), which would probably be a major annoyance to people who still write legacy HTML, of whom there are still quite a few out there I suspect, the number of years since XHTML was put forward notwithstanding. So, be happy that tye wrote it, because he did a pretty good job IMO of making the checker as lenient as could be reasonably hoped for. (There are people who would want no checker at all, but I think you understand why that would cause problems in practice.)

      update3: Hmmm... What I *thought* I saw was that it actually got stripped. What I *actually* discovered is that View Selection Source in Mozilla does not give exactly the same source as View->Page Source does. The former shows <hr> and the latter shows <hr />. Weird.

      $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
        The only hard part is remembering which of the less common but harmless and useful HTML tags don't work.
        No, no, no. The hard part is finding out which elements are named the same in both HTML and Perlmonks, but act differently. <code> for instance means something else in HTML than in Perlmonks. But I still haven't figured out how the <a> element is working on Perlmonks. Sometimes, it creates a link. Sometimes it appears as is.

        That, and remembering the entity for escaping the left square bracket. (I usually just put code tags around it. Easier to remember.)
        Easier to remember, but not easier to type. Having to type 13 extra characters to be able to type a common character in Perl isn't what I say "easy". At least in POD, you only need three extra characters: C<[>. And in POD, you don't even have to put any markup around a function() or a $variable. POD knows.
        If you want to see some needlessly complicated and gratuitously different site markup, have a look at Wikipedia sometime.
        Actually, I've contributed some bits to Wikipedia the last week. I vastly prefer the [[link]] syntax over [link] as it means one can use unescaped left brackets if they aren't followed by another left bracket. [..] is common when discussing perl. [[..]] is a rare appearance in Perl code. I also prefer mechanisms like ''foo'' or *bar* to make something emphasized/italics or strong/bold, like Wikis or news/mail readers do.


        The XML-style closing / gets stripped out too

        What? Yes, </hr> gets stripped now and didn't used to. But for some time now, <hr> has been changed to the XMLish <hr />.

        Oh, I see. There is a bug in that <hr /> can *report* (if you have error reporting set high enough) that the / was stripped when in fact it wasn't. I'll fix that soon.


        - tye        

        If you give me a list of tags, and where you think they should be allowed, I'll look at them. Can't promise more, I'm rather busy at present.

        Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      BTW, the node you are replying to isn't discussing any changes to how you mark up nodes at PerlMonks. Feel free to ignore it.

      The previous related node involved a fairly minor change: Instead of just expecting contributors to get their HTML elements properly nested, we are now checking for it and trying to fix any errors we find (trying to balance DWIM with code complexity/performance). We wouldn't be doing this except such errors can and do impact the contributions of others.

      This node is discussing (in quite a bit of detail) how much feedback you can choose to see from this process. If you find it too complicated for you to understand (or it just taxes your patience), then you should probably stop reading after the short summary (or just ignore it completely and keep the default settings or even just try different settings when you get bored).

      Implicit <code> tags would make for a rather ugly presentation (and a much less flexible one). I and others discuss POD elsewhere. With LaTeX, would we deliver the results as PDF or just big PNGs? (Sorry, I haven't used LaTeX in many years so I don't know how nice any LaTeX-to-HTML engines are -- but I suspect they'd take a lot more load than the current PerlMonks HTML production process.) Plain HTML would make posting Perl code difficult without using a program to help produce the HTML.

      I didn't have anything to do with the development of the "near-HTML-subset plus square bracket" syntax. I don't find it particularly hard to understand (and this was back when the documentation was much worse). And I appreciate the short cuts it provides (and realize it isn't a perfect choice for Perl, a language that makes fairly heavy use of nearly every printable ASCII character).

      If you simply want text, then the requirements are very simple:

      1. Put <p> where you want a blank line.
      2. Put <code> tags around any code (or other uses of &, <, >, [, and ] or text you need displayed in a fixed-width font, such as ASCII drawings). Try not to use this when you don't need it.

      You later complain about producing links. Plain text doesn't have links, so you need to decide whether you want plain text or not. If you want links, then please stop asking why you can't have plain text. (:

      - tye        

Re: Site HTML filtering, Phase II
by Anonymous Monk on Feb 11, 2004 at 15:18 UTC

    Testing. Try the following code:

    <h3>this is a broken title</h4> <!-- But it displays correctly --> <h3>this is a broken title spanning beyond the end of my post</h> <!-- It will "infect" all the page, till the end -->

      The code that formats replies needs to clean the HTML of each reply separately rather than format the whole list of replies and then filter the result. This is on my to-do list.

      - tye        

Re: Site HTML filtering, Phase II
by ysth (Canon) on Feb 16, 2004 at 04:16 UTC
    How would you feel about a ;htmlerror= level to show the source (as it would be show in an Update window)? It would make it easier to see what's going on when other peoples nodes come out strange (for instance, seeing what's up with the readmore tags on 329062). Obviously this level wouldn't be an option in user settings.

      There is already XML view. The whitespace gets compressed by the browser but you can 'view source' to see that if you need to.

      - tye        

        Thanks for the reminder, tye; does the XML view bypass the html correction? (Update: of course it does; should it?)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://328127]
Approved by Roger
Front-paged by gmax
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (12)
As of 2017-01-24 17:25 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (208 votes). Check out past polls.