http://www.perlmonks.org?node_id=177869

I've finished testing some changes to some lower-level aspects of the site and I'll be rolling those into production pretty soon.

These changes close some holes such as being able to include unapproved tags (including javascript) into nodes or even chatter. They also help inexperienced users (and others) by escaping unapproved tags rather than removing them.

So, for example, if I type "No, not me <wink />" into the chatterbox, prior to my changes the <wink /> would simply be stripped when my message is displayed on the HTML page, probably causing confusion. After my change, the <wink /> would be HTML escaped to &lt;wink /&gt; (actually, the closing > isn't escaped right now, though this doesn't matter in most cases and might change later) before going to the chatterbox nodelet so it would appear as "<wink />".

Also, unmatched < are escaped so chatting "<!-- hee hee" just shows exactly that rather than starting an HTML comment and hoping something else will eventually close it.

Before my change, [id://38] would produce something like <a href="/?node_id=38"></a> which is pretty useless (in most browsers you can't see it and in many you can't "click" it even if you guess where it is). After the change it will produce <a href="/?node_id=28">&#91;no such node, ID 28&#93;</a> ([no such node, ID 28]) so that everyone can more easily figure out what "broke".

Similarly, linking to nodes with no title produced the same mostly useless links. After the change, "[untitled node, ID 58043]" will be used for the link.

Another part of the change is that node titles (which include monk names) are officially being designated as text strings and not HTML strings. My searching of node (and user) titles shows that this has long been the prevalent assumption. Unfortunately, the PM code base almost always displayed node titles as HTML strings.

Several times in the past this inconsistancy was pointed out and each time things were changed to prevent HTML from being used in node titles. With one exception...

Although we have several cases of nodes (and users) that include & in a title with the clear intent of having a literal & result, we have recently had a few cases of people using & to indicate an HTML entity. A few of these cases were to overcome the previous patches that made all of < > [ ] illegal in node titles. But most of them can best be described as "mischief". For example, the users &nbsp;&nbsp; and tye&nbsp;.

There are lots of room for abuse of HTML entities. For example, I could register a user stefan&nbsp;k and make trouble for stefan k. I could even register a user &#118;room (vroom) and /msg ppl that they need to send me their password so I can fix something. And they make searching more complicated ("Should I search for résumé as résumé or as r&eacute;sum&eacute;?").

If you want interesting characters in a node title, then you'll have to restrict yourself to those in the Latin-1 character set and you'll have to enter them directly (not as HTML entities). After we find more of the places that still need HTML escaping, we'll remove the restriction against putting any of < > [ ] in node titles and you'll be able to write a node about "Why is [3,2,1]->[2] < 2 ?" by just typing that title in.

Now, the reason this was the one exception is that most browsers handle invalid HTML entities by displaying them literally. So, entering "XML & Perl" for a node title would result in something that looked fine. So uses of & to mean "&" were silently succeeding due to browsers being forgiving and there was no pressure to "fix" anything.

And the big exception in this case is one user using an HTML entity on all of their node titles. However, this same person said "something as obnoxious as •" when describing their own actions. So I'm not inclined to complicate a lot of aspects of the site in order to accomodate the self-described obnoxious habits of a single user. Note that a lot of visitors have been seeing &bull; in those node titles for a long time (because their browser doesn't know this HTML entity) so the change will just make this true for more visitors.

Now, the &bull; will be more obnoxious than • so we'll give each user an opportunity to batch-update their node titles to remove &bull;. We'll probably even allow you to change the &bull; to · (which is the closest thing to • in Latin-1).

We will replace other HTML entities in existing node titles with their unescaped characters when the changes are put into production.

The user-HTML parsing has also been improved so that is recognizes more near-HTML constructs (like "< / a >" which the previous HTML parsing almost recognized) and produces results more like xHTML (transforming <BR> into <br /> for example).

These changes should help in future efforts to extend [...] linking to be able to do many things that people often think it should be able to do but can't ([id://108949&user=tye], [user://grep], ...).

        - tye (but my friends call me "Tye")