|Just another Perl shrine|
Re^5: Breaking The Rules II (regex HTML)by tye (Cardinal)
|on Jul 03, 2007 at 21:38 UTC||Need Help??|
Oh, it depends on the data. I mostly agree with you, but I also know that it can be quite appropriate to just grab stuff out of the middle with a simpler regex in many cases. Ironically, I find this particularly true of HTML and XML because you are unlikely to find <em>Price:</em> in a place where it doesn't mean the obvious thing. That is, for example, <a href="<em>Price:</em>"> is very unlikely, even if XML (stupidly) doesn't just outright forbid it (it should, IMIO, require the inner angle brackets to be escaped).
But, of course, there are HTML comments and XML CDATA that do allow for such confusing situations (and I find CDATA a pretty pointless feature). In practice, I find that it is often quite safe to discount this risk, however.
My personal theory is that this mantra against using regexes against HTML (etc.) probably originated from more narrow situations where there were repeated attempts by novice coders to do more systematic manipulation of HTML. For example, someone proposing to use
in a web spider is more deserving of being warned to use a proper parser than someone trying to pull one stock price from an internal web site for personal use.
So the warning has some value even in the latter case, but it is more likely to be overstated. I don't see a rash of hot ranting about such things. It is almost always mentioned, which probably annoys some people.
I rarely see people admit that a simpler regex against something like HTML can often be a reasonable solution, yet I've certainly had much practical success doing such over the years, so I consider myself qualified to state that it can work well. But designing such a solution well requires taking this risk into account, so it can be important to point out the risk. And the response to such comments often encourages restressing the point.
So I can see how someone would think that this point is often being made dogmatically. But I don't see any value in trying to lump them into some imaginary group or apply insulting language to them. (: