http://www.perlmonks.org?node_id=274716


in reply to Perl Monks hypocrisy

Can't anyone who maintains this board figure out how to add an auto-line-break feature?

Just like most people, you use <p> tags. So you know why line-breaks are bad. Textareas may or may not wrap text. When a textarea does not wrap text, the user is likely to hit the return key in places where a <br> is not wanted.

(note the missing semicolon)

Quoting the HTML 4.0 specification:

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
As you can see, the semicolon is recommended, not required.

Well, thanks to my regex HTML parser, I discovered ...

Your parser being regex based is not relevant here. Regular expressions CAN be used to parse HTML. But a set of regexes that parses any HTML document correctly is much less efficient than something based on HTML::Parser. But it is very unlikely that your parser handles every feature that HTML offers.

I guess you could also infer from this post that I pay no mind to my reputation here.

In other words: you're a troll. Please troll elsewhere lest more people feed you.

it did find 283 other errors in http://perlmonks.com/index.pl.

I'm sure your patches are more than welcome. But for now: it works, so let's not break it while trying to fix a problem that isn't there in the first place.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Replies are listed 'Best First'.
Re^2: Perl Monks hypocrisy (HTML parsing speed)
by tye (Sage) on Jul 16, 2003 at 15:33 UTC
    But a set of regexes that parses any HTML document correctly is much less efficient than something based on HTML::Parser.

    As I recall, we tried a module based on HTML::Parser but had to drop it because it was way too slow (10-times slower, IIRC). PM uses a single regex to split the HTML into tokens and another regex to deal with filtering attributes in those tokens.

    There are two main reasons that I'd advise someone to not "parse HTML with (a) regex(es)". Performance is not one of them.

    The main point is that you probably shouldn't use something like /<td>(.*?)</td>/ because there is no way to make that ignore HTML comments that contain similar HTML. The other is that doing such can look easy but end up being very hard so it is often less work in the long-run to use a decent module from the start, even though that often looks like a more difficult approach.

    Update: The "HTML" that we parse is stuff typed in by our users "by hand". So our HTML parser (the regex) intentionally deals with certain border cases in specific ways. No, it does not strictly follow any one of the many HTML standards we have to choose from.

                    - tye

      As I recall, we tried a module based on HTML::Parser but had to drop it because it was way too slow (10-times slower, IIRC).

      The speed has everything to do with the complexity of your parser. If you don't need to follow specifics, and don't need to implement the usual browser quirks, a single regex is often a lot more efficient. It's up to the end user to benchmark it. Unfortunately, most novices don't know how to write the regex, don't know how to write an HTML::Parser based scripts and don't know how to benchmark.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re:x2 Perl Monks hypocrisy (&semi;)
by grinder (Bishop) on Jul 16, 2003 at 17:54 UTC
    Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
    As you can see, the semicolon is recommended, not required.

    Sorry, I have to side with Wassercrats on this one. Just because you can, sometimes, doesn't mean you should. It is easy to get that semi-colon in there. I've known a number of browsers over the years that never rendered correctly an entity lacking a semi-colon. Either they let it go through textually, or ate the remaining characters up to the end of the line.

    Even Mozilla had this problem up until a year or so ago. If you can count on a semi-colon being required you simplify the parsing greatly. Just because SGML says it's recommended that does not make a good basis for choosing to do so. SGML has all sorts of markup minimisation short cuts available, because at the time people were paid to key stuff in, paid by the keystroke and there were no fancy GUI editors around. And plus it's just more comfortable to be able to omit needless stuff.

    This made the job of writing an SGML parser a Herculanean undertaking. James Clark is about the only person who really pulled it off.

    A much more reasonable comparison would be to consider XML. There, the trailing semi-colon is mandatory. This is because Tim Bray and the team that created XML wanted something that was easy to parse. Easier than full SGML in any case, and in comparison to that they succeded admirably.

    I realise that the problem is difficult for Perlmonks. It would be feasible to make sure that any HTML generated directly by Everything is well-formed, but this does not take into account what passes for HTML typed in by the site's population.

    Argh, just thinking about &, &amp, &amp; and R&D and what Everything makes of them makes my brain hurt :)

    _____________________________________________
    Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

A reply falls below the community's threshold of quality. You may see it by logging in.