More HTML escaping

I've finished testing some changes to some lower-level aspects of the site and I'll be rolling those into production pretty soon.

These changes close some holes such as being able to include unapproved tags (including javascript) into nodes or even chatter. They also help inexperienced users (and others) by escaping unapproved tags rather than removing them.

So, for example, if I type "No, not me <wink />" into the chatterbox, prior to my changes the <wink /> would simply be stripped when my message is displayed on the HTML page, probably causing confusion. After my change, the <wink /> would be HTML escaped to <wink /> (actually, the closing > isn't escaped right now, though this doesn't matter in most cases and might change later) before going to the chatterbox nodelet so it would appear as "<wink />".

Also, unmatched < are escaped so chatting "<!-- hee hee" just shows exactly that rather than starting an HTML comment and hoping something else will eventually close it.

Before my change, [id://38] would produce something like <a href="/?node_id=38"></a> which is pretty useless (in most browsers you can't see it and in many you can't "click" it even if you guess where it is). After the change it will produce <a href="/?node_id=28">[no such node, ID 28]</a> ([no such node, ID 28]) so that everyone can more easily figure out what "broke".

Similarly, linking to nodes with no title produced the same mostly useless links. After the change, "[untitled node, ID 58043]" will be used for the link.

Another part of the change is that node titles (which include monk names) are officially being designated as text strings and not HTML strings. My searching of node (and user) titles shows that this has long been the prevalent assumption. Unfortunately, the PM code base almost always displayed node titles as HTML strings.

Several times in the past this inconsistancy was pointed out and each time things were changed to prevent HTML from being used in node titles. With one exception...

Although we have several cases of nodes (and users) that include & in a title with the clear intent of having a literal & result, we have recently had a few cases of people using & to indicate an HTML entity. A few of these cases were to overcome the previous patches that made all of < > [ ] illegal in node titles. But most of them can best be described as "mischief". For example, the users    and tye .

There are lots of room for abuse of HTML entities. For example, I could register a user stefan k and make trouble for stefan k. I could even register a user vroom (vroom) and /msg ppl that they need to send me their password so I can fix something. And they make searching more complicated ("Should I search for résumé as résumé or as résumé?").

If you want interesting characters in a node title, then you'll have to restrict yourself to those in the Latin-1 character set and you'll have to enter them directly (not as HTML entities). After we find more of the places that still need HTML escaping, we'll remove the restriction against putting any of < > [ ] in node titles and you'll be able to write a node about "Why is [3,2,1]->[2] < 2 ?" by just typing that title in.

Now, the reason this was the one exception is that most browsers handle invalid HTML entities by displaying them literally. So, entering "XML & Perl" for a node title would result in something that looked fine. So uses of & to mean "&" were silently succeeding due to browsers being forgiving and there was no pressure to "fix" anything.

And the big exception in this case is one user using an HTML entity on all of their node titles. However, this same person said "something as obnoxious as •" when describing their own actions. So I'm not inclined to complicate a lot of aspects of the site in order to accomodate the self-described obnoxious habits of a single user. Note that a lot of visitors have been seeing • in those node titles for a long time (because their browser doesn't know this HTML entity) so the change will just make this true for more visitors.

Now, the • will be more obnoxious than • so we'll give each user an opportunity to batch-update their node titles to remove •. We'll probably even allow you to change the • to · (which is the closest thing to • in Latin-1).

We will replace other HTML entities in existing node titles with their unescaped characters when the changes are put into production.

The user-HTML parsing has also been improved so that is recognizes more near-HTML constructs (like "< / a >" which the previous HTML parsing almost recognized) and produces results more like xHTML (transforming <BR> into <br /> for example).

These changes should help in future efforts to extend [...] linking to be able to do many things that people often think it should be able to do but can't ([id://108949&user=tye], [user://grep], ...).

- tye (but my friends call me "Tye")

Comment on More HTML escaping Select or Download Code

Replies are listed 'Best First'.
Re: More HTML escaping by Aristotle (Chancellor) on Jun 27, 2002 at 22:30 UTC
Looking forward to the changes. I know you've been putting quite a bit of work into the site recently, and wanted to express my thanks once again. Makeshifts last the longest.	[reply]
(FoxUni) Re(2): More HTML escaping by FoxtrotUniform (Prior) on Jun 27, 2002 at 22:42 UTC
Strongly seconded. tye++ `-- The hell with paco, vote for Erudil! :wq`	[reply]
•Re: (FoxUni) Re(2): More HTML escaping by merlyn (Sage) on Jun 27, 2002 at 22:43 UTC
A perfect example of VANITY TAGGING that should be forbidden the moment my `•` also stops working. {sigh} -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: •Re: (FoxUni) Re(2): More HTML escaping by ignatz (Vicar) on Jun 28, 2002 at 13:49 UTC
Re: More HTML escaping by Abigail-II (Bishop) on Jul 04, 2002 at 14:19 UTC
Now, the reason this was the one exception is that most browsers handle invalid HTML entities by displaying them literally. So, entering "XML & Perl" for a node title would result in something that looked fine. So uses of & to mean "&" were silently succeeding due to browsers being forgiving and there was no pressure to "fix" anything. I would like to point out that browsers displaying "XML & Perl" when encountering "XML & Perl" in an HTML document are actually doing the right thing. Let's not forget that HTML is an SGML application, and not an XML one. And SGML is to XML as Perl is to Python: it doesn't impose artificial rules on the author (programmer) just to make parsing easier. A lone & is just fine in HTML. From RFC1866: An ampersand is only recognized as markup when it is followed by a letter or a `#' and a digit. (Yes, I know the status of RFC1866 is "Historic", but this refers to SGML (ISO 8879:1986), a standard that hasn't been withdrawn). Abigail	[reply]
•Re: More HTML escaping by merlyn (Sage) on Jun 27, 2002 at 22:41 UTC
And the big exception in this case is one user using an HTML entity on all of their node titles. However, this same person said "something as obnoxious as •" when describing their own actions. OK, I'm all for stopping this, but not until ALL VANITY TAGS are forbidden. After all, that's really the only reason I've persisted in doing this for the past few months. Is this now the new Monestary policy? If so, I'll support it. If not, I will ask that `•` be continued to work, so I can continue my obvious protest. -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
(tye)Re: More HTML escaping by tye (Sage) on Jul 03, 2002 at 15:18 UTC
[ Note that I am replying to this node after so long a delay because I had hoped that no reply would be necessary but recently received a request that I felt was best replied to here. ] First, let me repeat that `•` was not a motivating factor behind this change. The main connection between this change and `•` was that `•` delayed this change. In the end, I decided to make this change despite it disabling `•` because I felt that `•` was considered obnoxious by many, including you. I don't see any pallatable technical way to outlaw personalized reply indicators. I see a policy-only ban as being doomed to failure after a lot of time spent by editors and/or complaining and finger pointing. And I don't see a consensus supporting such a ban. "Vanity tags" is a cute term for it, but (just to be clear) I don't include "(tye)" in the titles of my replies out of vanity. If that were the reason, then this thread would have started at "(tye) More HTML escaping". I've seen that a few people have stopped similar practices now that author information is reported in Search Results. But making the author easier to determine was never even a big part of my motivation. I just find duplicate node titles to be a pain. And the easiest way for me to ensure that my replies have a unique title is to usually include "(tye)" because that is a short, simple string that I can be pretty sure will not be used in other node titles. Then I can just keep track of how many times I reply in a thread (w/o modifying the title for other reasons) and include a sequence number in replies other than the first one. There are lots of cases where duplicate titles cause me problems. Recently I went back and added "(tye)" to (tye)Re: A question of style because I often refer to that node when questions about what difference `&` and `()` make when calling Perl subroutines. Depending on how I'm surfing, looking up the node ID for that item can be pretty easy or nearly impossible. But now I can link to it easily without having to memorize a node ID (I have a hard time remembering the node ID of tye and I actually end up using that quite a bit). Now, I have other ideas of how to make node titles unique more often. For example, I kinda like the idea of prefixing reply titles with "Re2.1.3: " to indicate that this is the 3rd reply to the 1st reply to the 2nd reply to the root of this thread. It is a bit shorter than the current "Re: Re: Re: ". Or even "Re213: " with the "." only used around multi-digit parts. But I strongly suspect that such a change would be deemed "ulgy" by a great many of our members. After all, I've heard other proposals that I find pretty ugly. (: And even if we implemented a unique reply indicator, I'd probably still end up using "(tye)Re: " on at least some of my replies because I wouldn't want to have to remember whether my node was Re4: A question of style or Re6: A question of style. Some related thoughts (and ones that I've mostly just repeated or refined above) can be found in (tye)Re: Dingbats in node titles: What's your opinion. I'm not sure why you've continued this "protest" as long as you have. I haven't seen anyone express any support for your idea of banning all personalized reply indicators. Sure, I've seen people support the idea of banning "dingbat characters" in node titles and I've seen people express a dislike for "(tye)" in node titles. But noone saying we should try to ban them. Perhaps you have received support in private? I would think that such a strong lack of public displays of support for your proposal would lead you to abandon your protest aimed at getting such a ban imposed. I think Macphisto was accurate in calling this "tilting at windmills". I don't foresee such a ban ever happening (because I don't see any practical way to implement it, even if there were support for it) and so I don't feel bad about having forced you to change your method of (futile) protest. Feel free to switch to · or stay with `•` if you wish to continue your protest. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
•PLEASE STOP VANITY TAGGING by merlyn (Sage) on Jul 03, 2002 at 15:41 UTC
I just find duplicate node titles to be a pain. And that's why `[id://nnnn]` was invented. Use it. Stop tagging your articles. Tell others to stop tagging theirs, now that we have both authors in search results and unique-ID'ed references. And I'll stop tagging mine. But you have to make it "a request from the gods", or it won't be followed, knowing how things get done around here. But don't break my style of tagging just because it's different. That's what I object to: I seem to be being singled out, while others continue to be permitted to tag without hesitation. Unfair. Would you rather I tag each of my responses with (PLEASE STOP VANITY TAGGING)? I might just start doing that to get the point across. Tagging was useful a year ago. It's persisted only because nobody realizes why it started. Let's stop it now! -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: PLEASE STOP TILTING AT WINDMILLS by Macphisto (Hermit) on Jul 03, 2002 at 16:38 UTC
Re: •PLEASE STOP VANITY TAGGING by shotgunefx (Parson) on Jul 03, 2002 at 17:28 UTC
•Re: Re: •PLEASE STOP VANITY TAGGING by merlyn (Sage) on Jul 03, 2002 at 17:35 UTC
Some notes below your chosen depth have not been shown here
Re^2: More HTML escaping by Aristotle (Chancellor) on Jul 04, 2002 at 16:04 UTC
Recently I went back and added "(tye)" to (tye)Re: A question of style because I often refer to that node when questions about what difference & and () make when calling Perl subroutines. [..] now I can link to it easily without having to memorize a node ID (I have a hard time remembering the node ID of tye and I actually end up using that quite a bit). I have a few of those also - scant, and not referred to quite as often, but the point stands. I would go so far as to say that such nodes deserve an edited, unique title that identifies them by content, rather than author. As far as looking them up is concerned, I found the Personal Nodelet a good place for the really frequently used ones, while the homenode does a sufficient job for the occasionally useful ones. As an admittedly more clunky alternative, one might also refer people to people a <a name="faqchest">'ed section on one's homenode rather than directly to the node in question (and who knows, they might linger and read the other interesting nodes as well). I'm not going to start a crusade against vanity tagging, since although I do occasionally find it annoying, I don't feel it isn't enough so to warrant more than a silent protest by cleaning others' tags from the subjects of my posts. My point is that I do agree with merlyn in that I don't see how tags offer anything that cannot be achieved otherwise with no more or very little more effort. For the record, Re(1.2.1.3.1) style subjects would certainly be less ugly than Re: •Re: (tye)Re: (FoxUni)Re: Re: foo bar. Makeshifts last the longest.	[reply]
(tye)Re: More HTML escaping by tye (Sage) on Jul 04, 2002 at 18:27 UTC
Re: More HTML escaping by Macphisto (Hermit) on Jun 28, 2002 at 03:24 UTC
Just to remove the mystery: I downvoted on this because I think it should never have come to pmdev even needing to consider the removal of your • - you should have stopped it on your own. Stop tilting at windmills and you'll get some ++'s from me. Everyone has their demons...you just happen to mine.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks