http://www.perlmonks.org?node_id=224559


in reply to Re: A few random questions from Learning Perl 3
in thread A few random questions from Learning Perl 3

It might be useful to read up a bit on the theory of formal languages. You'll see that there's a whole family of languages, each described by a certain mathematical formalism. Regular languages are an example, and as you can guess they're described by regular expressions. Unfortunately, HTML is not a regular language and hence can not be described by regular expressions since they're just not powerful enough.

By way of example, consider <em>hello beautiful HTML <em>world</em></em>: easy to write a regular expression to get the inner "world", isn't it? Now consider <em>hello <em>beautiful<em>HTML world</em></em></em>, if you want to match something, again you can write a regular expression... as long as you know the maximum number of times the <em>...</em> tags will be embedded.

HTML allows unbounded nesting of tags, so this means that you can't write a general regular expression that describes every possible nesting situation. Regular expression are simply not powerful enough, you'll need at least context free languages, hence a tool such as HTML::Parser or for general cases something like Parse::RecDescent.

Now you can argue:

  1. yeah right, but real world HTML is not that complicated, or
  2. you can fiddle with embedded code and cuts in regular expressions.
As to the first argument: you don't always know this in advance if you don't control the HTML generation yourself, people are bound to do weird things, mostly not even on purpose.
As to the second argument: true, but these are still experimental features (as the docs specify for 5.6.1) and they're not at all obvious to use, even up to the point that it is easier to use a more powerful tool than get the particular regular expression right. (Note from a formal language theory point of view: embedded code, cuts and the like increase Perl "regular expressions" beyond regular languages.)

Given this story, your claim that one can deal with all problems HTML by using regular expressions shows some unwarranted optimism on your part. Obviously there's no reason to believe me, so I'll suggest a number of references on the subject:

And who knows, maybe our own mstone will write a MOPT on the subject one of these days? (Hint, hint ;-)

Just my 2 cents, -gjb-

Update: Thanks TheHobbit for reiterating the points I actually mention in my text if you bother to read it carefully. (?{...}) and /e are called code embedding.