|Problems? Is your data what you think it is?|
Golf: Fix de facto HTML commentsby tye (Sage)
|on Jul 18, 2004 at 18:58 UTC||Need Help??|
Your challenge is to 'golf' some Perl code (produce code that requires the fewest [key] strokes -- fewest characters) that mostly just does s/--/¬-/g, but with some simple restrictions. I was surprised that I implemented this simple task over a dozen times before I finally got it right. I golfed mine down to 80 characters, so I wanted to see what y'all can come up with. Getting a correct solution may be a bigger challenge than golfing the solution.
A 'de facto HTML comment' is started by "<!--" and ended by "-->" and can contain anything between those two delimiters except, of course, "-->". This is such a nice, simple, easy-to-parse definition that it has advantages over a standard HTML comment.
Some (notorious but still very popular) browsers only handle de facto HTML comments. Many browsers only handle standard HTML comments.1
Your task is to golf some code that will adjust de facto HTML comments so that they are also standard HTML comments. I'll let those who are curious about the details of standard HTML comments visit Google. The only detail we need to worry about for the golf is that "--" inside of a de facto HTML comment is the problem.
Although "<!-- foo -- -- bar -->" is a valid HTML comment according to both the standard and de facto definitions, I'll make the task much easier by just requiring that all occurrences of "--" be replaced inside of the de facto comments. But we want to change as few pixels as possible so we'll transform the above comment to something like "<!-- foo ¬- -¬ bar -->".
If you can code a solution that changes even fewer characters but still makes sure each de facto comment ends up also being a standard comment, then you'll get bonus points (in the tradition of Whose Line Is It Anyway).
I chose "¬" (the "not" symbol, "\xAC", ¬=¬) because it looks a lot like "-" in most fonts and is still in Latin-1. The soft hyphen (­=­) looks even closer to "-" but shouldn't be displayed at all in most cases, so I rejected it. The en dash is "–", –, –, and is "\x96" in Windows-1252 (Microsoft's extension to Latin-1 which is nearly the de facto interpretation of "Latin-1") and it also looks even more like "-". But some browsers are still standards-compliant enough that they won't display that. How does your browser display it (–)?
Later I'll post my solution and some test code that covers some of the rules. For now, I don't want to hint at techniques to try.
Here is some test data (but don't assume this is the only data you need to handle):
1 Some browsers don't manage to get either definiton right. I have a copy of Opera that appears to require < and > to be balanced inside of HTML comments. Opera impresses me both with its nice features and how it manages to have bugs that are just so, well, stupid. (: