http://www.perlmonks.org?node_id=284298

As we all know, the canonical example of what not to do with regular expressions is to parse HTML. Unfortunately, this is one of the first things that programmers seem to want to do when they learn regexes. For example, consider the idiocy it took to write this:

$data =~ s{(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3 +)?)([^>]+>)} {$1.decode_entities($2).$4}gsei;

I can call that idiocy because I'm the fool who wrote it.

You'll find tons of sites that say it's okay to parse HTML with regexes (even Microsoft, though the link to their site is now broken), but we know better, right? Unfortunately, some of the hacks we use to get various HTML parsers to match the ugly HTML can get pretty ugly. It's not just a matter of designers forgetting to quote attributes, but also of matching a variable number of tags, or maybe some documents close paragraph tags and others don't. Being able to use regular expression semantics on HTML would be pretty nice. Well, now you can (sort of).

When Perl was first released, regular expressions matched bytes. That means you could forget about Unicode. Later, Unicode support was added so you could properly match characters, but after reading Dominus' article about how to build a regular expression engine, it occurred to me that instead of matching bytes or characters, I could match tokens.

What follows is a very simple demonstration of the Token::Regex module that you can download from this link. I'm not putting it on CPAN (but perhaps in the future) because it's very alpha code and, in any event, it's just a hack on the Regex.pm module that Dominus provided, so I'm not sure that my uploading would be appropriate prior to asking for his consent (of course, this code is so fragile I'd be embarrassed to upload it).

First, I need to create appropriate tokens. In my scheme, tokens merely need to have an identifier() method to work. So lets create a small class that allows HTML tokens that are tags to produce identical identifiers if and only if they are the same type of token and have identical attribute names (regardless of order or value). In other words, the following tokens will be considered identical for the purpose of a regular expression match:

<p class="foo" name="bar"> <p name="bar" CLASS="ovid has no class"> <p NAME="bar" class="ovid has no class"> <p name="bar" class="ovid has no class">

Here's one way we could code the package:

package HTML::Token; sub new { my ($class,$token) = @_; my $self = bless { token => $token }, $class; my $identifier = $self->identifier; return $self; } sub identifier { my $self = shift; my $token = $self->{token}; my $attributes = $token->return_attrseq; my $tag = $token->return_tag; if (ref $attributes eq 'ARRAY') { return sprintf "%s %s", $tag, join ' ', sort @$attributes; } else { return $tag; } }

Now let's create a regular expression array. To use this, every element is either a token or a single regex meta character. See the POD for an allowed list of meta characters.

my $html = <<END_HTML; <h1>This is a test</h1> <p class="foo" name="bar">so what??? END_HTML my $parser = HTML::TokeParser::Simple->new(\$html); my @tokens = (); while (my $token = $parser->get_tag) { push @tokens => HTML::Token->new($token); } push @tokens => (qw[* . *]); # make the (p) tag zero or more, followed + by anything

Note the last line. With this regex engine, all regexes are bound to the beginning of the array and the end of the array. To "unbind" them, you have to include a dot star at the beginning and end.

The above regular expression says "match an opening H1 tag with no attributes followed by a closing H1 tag followed by one or more P tags with attributes of "class" and "name" followed by zero or more of anything". (deep breath).

And to set it up:

use Token::Regex; my $regex = Token::Regex->new('HTML::Token'); $regex->parse(\@tokens);

We're now ready to match some HTML. Just provide an array ref of tokens to the match() method:

my $tokens = html_tokens(<<END_HTML); <h1>This is html</h1> <p name="bar" CLASS="ovid has no class">so what??? <p NAME="bar" class="ovid has no class">so what??? <p name="bar" class="ovid has no class">so what??? <h2>and this is okay</h2> END_HTML if ($regex->match($tokens)) { print "Yes\n"; } else { print "No\n"; } $tokens = html_tokens(<<END_HTML); <p name="bar" CLASS="ovid has no class">so what??? <p NAME="bar" class="ovid has no class">so what??? <p name="bar" class="ovid has no class">so what??? <h2>and this is okay</h2> END_HTML if ($regex->match($tokens)) { print "Yes\n"; } else { print "No\n"; } sub html_tokens { my $html = shift; my $parser = HTML::TokeParser::Simple->new(\$html); my @tokens; while (my $token = $parser->get_tag) { push @tokens => HTML::Token->new($token); } return \@tokens; }

The first bit of HTML should match, but the second should fail (due to no H1 tag).

As stated previously, this is very fragile code, so using this in production is not a good idea. However, note that it can be used with any tokens that you care to create so long as you create an appropriate identifier() method. Further, backreferences cannot be added to this code and I don't know enough about regular expression engines to rewrite the code to support that. Parentheses are currently for grouping.

Furthermore, the code is difficult to use and ugly as sin. It's also slow and I don't know that anyone would really have a use for it, but I thought it was a neat hack. Any suggestions for making it easier would be welcome.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: How to use Regular Expressions with HTML
by Anonymous Monk on Aug 16, 2003 at 12:50 UTC

    As we all know, the canonical example of what not to do with regular expressions is to parse HTML.

    It always bugs me when I see people say this. Its one of those self-defeating generalizations that just confuses things because people observe that when taken literally it often isn't true.

    If I have a static piece of HTML, especially machine generated and/or simply structured I can easily munge and extract with a regex or two and a bit of logic. This will take far less time than using HTML::Parser or HTML::TokeParser or HTML::TreeBuilder or your tokenizer here.

    On the other hand it is very difficult to parse any arbitrary page using the same approach. In fact it is usually trivial to reverse engineer a regex based parser to construct an HTML snippet that will break the parser.

    Anyway my point is that parsing any arbitrary HTML is hard to do with regexes, however on occasion it can be just the thing you need to rip the essential data out of some specific web-page or html-report. If you are only going to run the extractor once then sometimes propper parsing is just too big a hammer to get out of the box. Accordingly i'd prefer to see that line rephrased.

    :-)

      Here are a few other generalizations:

      • Use strict.
      • Don't reinvent the wheel.
      • Don't use goto.
      • Don't optimize up front.
      • OO modules shouldn't export anything.

      Those are all great ideas and Perl programmers would be better off if they lived by them. That being said, I've broken every one of those rules and will happily do so in the future, if need be. The important thing is that I understand the reasoning behind those things and try to live by them.

      From what I can see from your post, you have the same opinion about HTML that I do, but you spent a lot of time qualifying it. I have that sort of attitude regarding my above list of generalizations, but I'd never get a single post finished if I was forced to make all of those qualifications. I toss out the generalizations first and then list exceptions only if needed.

      In short, I'm not arguing with you, but for most situations that I encounter, whipping out regular expressions for HTML is a bad idea and encouraging programmers to follow that practice would be an even worse idea.

      Cheers,
      Ovid

      New address of my CGI Course.

        I finished reading Christopher Alexander's The Timeless Way of Building today. Alexander writes about architecture, but his ideas, such as patterns, have been adopted by software developers.

        Your list above (use strict, don't reinvent the wheel, etc.) is basically a list of Perl patterns, practices that should exist in well written programs.

        Although we learn good habits by following rules, we ultimately derive those rules from observing what we find good. Patterns, or best practice, summarise our experiences and allow us to share them with others.

        In his last chapter, Alexander notes that another place can be without the patterns which apply to it, and yet still be alive: we should follow the spirit of the rules we lay down, not the letter. So paradoxically you learn that you can only make a building live when you are free enough to reject even the very patterns which are helping you once you understand the patterns well.

        Tim Bray uses Perl's regular expressions to parse XML and you use regexps to parse HTML. I don't anticipate doing either any time soon, because the general problems I encounter fit the solution of using existing CPAN modules, and because I don't consider myself knowledgeable enough about such things to break the rules yet.

        One should only break the rules when one understands why the rule exists.

        Like Ovid, I have broken every one of those rules, and more. (Personally, I love playing with soft references in production code, but I'm masochistic.) But, I will follow those rules in 99.9% of my code. The point is that most programmers shouldn't parse HTML with regexes most of the time. Heck, most shouldn't do it all of the time. And, if you do it, it should be modularized, packaged, and then never messed with again. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.