Perl Monks hypocrisy

http://www.perlmonks.org?node_id=274711

This node falls below the community's threshold of quality. You may see it by logging in.

Comment on Perl Monks hypocrisy Select or Download Code

Replies are listed 'Best First'.

Re: Perl Monks hypocrisy
by Juerd (Abbot) on Jul 16, 2003 at 07:35 UTC

Can't anyone who maintains this board figure out how to add an auto-line-break feature?

Just like most people, you use <p> tags. So you know why line-breaks are bad. Textareas may or may not wrap text. When a textarea does not wrap text, the user is likely to hit the return key in places where a <br> is not wanted.

(note the missing semicolon)

Quoting the HTML 4.0 specification:

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

Well, thanks to my regex HTML parser, I discovered ...

Your parser being regex based is not relevant here. Regular expressions CAN be used to parse HTML. But a set of regexes that parses any HTML document correctly is much less efficient than something based on HTML::Parser. But it is very unlikely that your parser handles every feature that HTML offers.

I guess you could also infer from this post that I pay no mind to my reputation here.

In other words: you're a troll. Please troll elsewhere lest more people feed you.

it did find 283 other errors in http://perlmonks.com/index.pl.

I'm sure your patches are more than welcome. But for now: it works, so let's not break it while trying to fix a problem that isn't there in the first place.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re^2: Perl Monks hypocrisy (HTML parsing speed)

by tye (Sage) on Jul 16, 2003 at 15:33 UTC

But a set of regexes that parses any HTML document correctly is much less efficient than something based on HTML::Parser.

As I recall, we tried a module based on HTML::Parser but had to drop it because it was way too slow (10-times slower, IIRC). PM uses a single regex to split the HTML into tokens and another regex to deal with filtering attributes in those tokens.

There are two main reasons that I'd advise someone to not "parse HTML with (a) regex(es)". Performance is not one of them.

The main point is that you probably shouldn't use something like /<td>(.*?)</td>/ because there is no way to make that ignore HTML comments that contain similar HTML. The other is that doing such can look easy but end up being very hard so it is often less work in the long-run to use a decent module from the start, even though that often looks like a more difficult approach.

Update: The "HTML" that we parse is stuff typed in by our users "by hand". So our HTML parser (the regex) intentionally deals with certain border cases in specific ways. No, it does not strictly follow any one of the many HTML standards we have to choose from.

Re: Re^2: Perl Monks hypocrisy (HTML parsing speed)

by Juerd (Abbot) on Jul 16, 2003 at 19:47 UTC

As I recall, we tried a module based on HTML::Parser but had to drop it because it was way too slow (10-times slower, IIRC).

The speed has everything to do with the complexity of your parser. If you don't need to follow specifics, and don't need to implement the usual browser quirks, a single regex is often a lot more efficient. It's up to the end user to benchmark it. Unfortunately, most novices don't know how to write the regex, don't know how to write an HTML::Parser based scripts and don't know how to benchmark.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re:x2 Perl Monks hypocrisy (&semi;)

by grinder (Bishop) on Jul 16, 2003 at 17:54 UTC

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
As you can see, the semicolon is recommended, not required.

Sorry, I have to side with Wassercrats on this one. Just because you can, sometimes, doesn't mean you should. It is easy to get that semi-colon in there. I've known a number of browsers over the years that never rendered correctly an entity lacking a semi-colon. Either they let it go through textually, or ate the remaining characters up to the end of the line.

Even Mozilla had this problem up until a year or so ago. If you can count on a semi-colon being required you simplify the parsing greatly. Just because SGML says it's recommended that does not make a good basis for choosing to do so. SGML has all sorts of markup minimisation short cuts available, because at the time people were paid to key stuff in, paid by the keystroke and there were no fancy GUI editors around. And plus it's just more comfortable to be able to omit needless stuff.

This made the job of writing an SGML parser a Herculanean undertaking. James Clark is about the only person who really pulled it off.

A much more reasonable comparison would be to consider XML. There, the trailing semi-colon is mandatory. This is because Tim Bray and the team that created XML wanted something that was easy to parse. Easier than full SGML in any case, and in comparison to that they succeded admirably.

I realise that the problem is difficult for Perlmonks. It would be feasible to make sure that any HTML generated directly by Everything is well-formed, but this does not take into account what passes for HTML typed in by the site's population.

Argh, just thinking about &, &amp, & and R&D and what Everything makes of them makes my brain hurt :)

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: Perl Monks hypocrisy
by PodMaster (Abbot) on Jul 16, 2003 at 07:39 UTC

One big reason for people liking Perl is that it's a quick, compact language. Why then is this the only of a gazillion message boards (that I know of) that requires the use of tags for something as simple as a line break? Can't anyone who maintains this board figure out how to add an auto-line-break feature?

... Well, thanks to my regex HTML parser ...

2. What couldn't you do that with an existing html parser that you had to roll your own?
Reusing proven tools improves productivity. I fail to see why the perlmonks should help people debug regexes for html parsing any more than they should help someone roll their own CGI.pm. It's just a waste of time.

update: You want autobreaking, make like a good perlmonk and suggest the feature effectively. I for one would not like it one bit, cause I've been formatting my posts by hand for 2-3 years now, and I ain't gonna change any time soon.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Re: Perl Monks hypocrisy
by katgirl (Hermit) on Jul 16, 2003 at 13:02 UTC

Though the w3 validator didn't catch that one, it did find 283 other errors in http://perlmonks.com/index.pl.

:)

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: Perl Monks hypocrisy
by crenz (Priest) on Jul 16, 2003 at 10:36 UTC

On a factual level: Yes, the people who maintain this board have enough brains to figure out an auto-line-break (or "insert <p> tags" feature). That's what I infer from my dealings with them. So my guess is that not having this feature is a deliberate choice, rather than dumbness. A lot of message boards resort to inventing their own tags to allow for formatting. For a site like this one that caters to programmers, I find it an excellent choice to just give them all the expressiveness they want by letting them use what they already know: HTML.

Regarding the validity of the HTML generated -- PM has grown over the years, and little bugs can be found here and there. You can help the site maintainers by giving appropriate feedback.

On a personal level: Looking at your post, I get the impression that you are trying to pretend that you want PM to improve, but actually you want to show that you know things better than those who are too dumb to figure out auto-line-break features. Or maybe you are retaliating against someone who told you not to parse HTML using regexes. Well, hopefully I'm wrong.

At PM, we tend to not take ourselves or Perl too serious. That helps in maintaining good relationships and actually getting work done.

Re^2: Perl Monks hypocrisy (choice)

by tye (Sage) on Jul 16, 2003 at 15:07 UTC

So my guess is that not having this feature is a deliberate choice, rather than dumbness.

Please see Re: Convert Text to HTML Checkbox (POD).

Re: Perl Monks hypocrisy

by Abigail-II (Bishop) on May 28, 2004 at 13:59 UTC

For a site like this one that caters to programmers, I find it an excellent choice to just give them all the expressiveness they want by letting them use what they already know: HTML.

I don't think anyone knows the language being used on Perlmonks prior to arriving here.

Abigail

Re: Re: Perl Monks hypocrisy

by Sandy (Curate) on May 27, 2004 at 23:08 UTC

For a site like this one that caters to programmers, I find it an excellent choice to just give them all the expressiveness they want by letting them use what they already know: HTML.

Sometimes I feel a little out of place. Many monks are doing web-based stuff, but little-ole me has never (i repeat never) did anything with a web except browse it, or sweep it away.

However, I did go and learn some rudimentary HTML, and javascript, just so I could be like the cool kids.

Once I got used to the <p> tags, I was ok with it. Don't know how to indent quotes from other people though (like the quote above). Help anyone?

Anyways, back to the main topic (??) Problem with wrap-around text is sometimes I insert a <CR> and sometimes not (my mail messages usually look stupid). At least this way I have to think about what I am doing.

UPDATE: Added the <blockquote>. Thanks diotalevi.

[reply]
[d/l]
[select]

Re: Re: Re: Perl Monks hypocrisy

by diotalevi (Canon) on May 27, 2004 at 23:22 UTC

<blockquote>Don't know how to indent quotes from other people though (like the quote above). Help anyone?</blockquote>

Like that.

Re: Perl Monks hypocrisy
by chromatic (Archbishop) on Jul 16, 2003 at 16:26 UTC

Can't anyone who maintains this board figure out how to add an auto-line-break feature?

No, we're all too stupid.

Seriously. Round-tripping is hard. Do you convert everything into a canonical form in the database and render it into HTML for display and then back to the poster's preferred form for editing? How do you deal with quirks and mistakes? Is it important to guarantee that the post is preserved as originally typed? There's also backwards compatibility to deal with, some 270,000 nodes that are stored as HTML fragments already.

If you have an easy solution, I'm all ears. I've been doing this long enough that I don't believe in many easy solutions, though.

Re: auto-line-break

by simonm (Vicar) on Jul 17, 2003 at 00:03 UTC

For what it's worth, I think tye's post sets out a plausible road map for dealing with this issue:

The new functionality remains inactive if the text of a post contains anything that looks like tagged markup. (This could be a conservative test, such as /\<\w+/.)
The auto-markup process should be paragraph oriented rather than line oriented, except that indented lines should get code treatment. (Similar to POD.)
Rather than round-trip conversions, the function be applied in one direction, at the time the message is being edited. On the preview page, if the post doesn't seem to contain any tags, show an auto-markup version on the bottom of the page, and give the user an "Auto-Markup" button that applies the markup and shows the preview again.

Am I sweeping too many details under the carpet, or might this only require a handful of lines of new code?

# In site init code somewhere
use HTML::FromText;
my %text2html_options = map ($_=>1) qw(paras blockcode urls email);
... 
# In preview form command handler
if ( $op = "auto-format" ) {
  $doctext = text2html( $doctext, %text2html_options );
}
... 
# At bottom of preview page
if ( $doctext !~ /\<\w+/ ) {
  my $markup = text2html( $doctext, %text2html_options );
  if ( $doctext ne $markup ) {
    print "<hr>If the automatic formatting below looks correct, you ca
+n apply it with <input type="submit" name="op" value="auto-format" />
+. <p>$markup";
  } 
}
...
[download]

Re: Perl Monks hypocrisy
by dws (Chancellor) on Jul 16, 2003 at 16:58 UTC

Maybe hypocrisy is too strong a word, ...

When you find yourself questioning your choice of title, especially in your opening sentence, change the title. Otherwise, it looks like you're asking "did I really mean to do that?" after throwing a turd. Either use greater care when choosing your words, or mean what you say.

Re: Re: Perl Monks hypocrisy

by jonnyfolk (Vicar) on Jul 16, 2003 at 18:21 UTC

Well, his actual words were:

Maybe hypocisy is too strong a word

So maybe Wassercrats should write himself a spellcheck to go along with the "regex HTML parser" :)

Re: Re: Re: Perl Monks hypocrisy

by rob_au (Abbot) on Jul 16, 2003 at 23:31 UTC

So maybe Wassercrats should write himself a spellcheck to go along with the "regex HTML parser" :)

And write the spellchecker as a (very long) regular expression :-)

perl -le 'print+unpack"N",pack"B32","00000000000000000000001001110010"'

Re: Perl Monks hypocrisy
by talexb (Chancellor) on Jul 16, 2003 at 15:21 UTC

I guess you could also infer from this post that I pay no mind to my reputation here.

Again with the XP bit. This is sooo tiring. If the 283 errors on the front page are bugging you, sign up to become a PM developer and spend your time improving the universe, rather than whining and complaining. And your argument would have carried more water if your own site passed validation (as previously noted).

This site works very well. I find it's a fantastic resource. If you think it's stupid, badly formatted and poorly programmed, you are, of course, entitled to your own opinion, no matter how stupid. Does this mean you're on your way then?

Have a cheery day.

--t. alex

Life is short: get busy!

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: Perl Monks hypocrisy
by cfreak (Chaplain) on Jul 16, 2003 at 13:41 UTC

I would like an auto-break feature as well. Maybe something like on /. where you can choose HTML formatted or plain text (the plain text autobreaks but still allows links and things like bold and italic)

However I'm not sure ranting about it is the right way to get such a feature implemented. I mean if you want to complain about hypocrisy, complain about people who flame the newbies rather than trying to point them in the right direction, that's hypocrisy since someone at some point probably helped them. A technical issue is not hypocrisy, in fact I'm willing to bet that it most people find it to be a feature, and it probably also saves the server on processing power.

As for the HTML parsing with regexes: It is a fact that a regex cannot catch all possible valid HTML, however a true parser can and there are parsers availiable. Just like every other suggestion for a module on this site suggesting an HTML parser is done to save the user time from re-inventing a wheel. This is consistant, use the right tool for the job.

Lobster Aliens Are attacking the world!

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: Perl Monks hypocrisy
by Aristotle (Chancellor) on Jul 16, 2003 at 15:29 UTC

Makeshifts last the longest.

Re: Perl Monks hypocrisy
by chunlou (Curate) on Jul 16, 2003 at 18:47 UTC

If auto line break were to be implemented, the board would probably have to offer the input text box in two flavors: Basic (where you have the auto line break) and Advanced (where you do your own formatting).

It is the same reason some people use Frontpage, Dreamweaver, etc. to mark up a website; some use simply Notepad or something plain.

In reality, auto line break could be one of those seemingly trivial problems but not necessarily trivial to a computer. If you mean "line break" as literally "\n" wherever they appear, that's easy (but it will break many other HTML code, such as a table unless someone enters a HTML table all in one line).

If you mean "line break" as "paragraph" that actually could be very hard. It might look "obvious" to human eye what a paragraph is but it's very for a, say, HTML parser to distinguish because it can only read data, not content.

As to the second "hypocrisy" or inconsistency, that's a good catch. But since the site has been up and running for a long while, not sure those errors or warning messages matter.

Eventually, the content of the site that people see is more important than the code behind the site that people don't see.

Re: Re: Perl Monks hypocrisy

by Anonymous Monk on May 27, 2004 at 16:26 UTC

Eventually, the content of the site that people see is more important than the code behind the site that people don't see.

Re: Perl Monks hypocrisy
by Jenda (Abbot) on Jul 16, 2003 at 21:39 UTC

I guess the closest to your auto-line-break that would have any chance of being implemented here would be to be able to write the nodes in POD. But the preformance problem chromatic talks about is very valid. This would mean that the nodes written in POD would have to be converted to HTML each time they are to be displayed or they'd be converted to HTML when submited and be presented as HTML to the user if he ever tries to modify it.

Of course the PerlMonks system could store both versions of such nodes, but I think that would be just a waste of space.

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

Re: Perl Monks hypocrisy
by vnpandey (Scribe) on Jul 18, 2003 at 22:51 UTC

Back to Perl Monks Discussion