in reply to Scraping HTML: orthodoxy and reality

Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:

Rules are meant to be broken, but you've to understand them before you can break them... safely.

Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.

I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.

It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.

Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.

Just my 2 cents, -gjb-

  • Comment on Re: Scraping HTML: orthodoxy and reality