comment on

As we all know, the canonical example of what not to do with regular expressions is to parse HTML. Unfortunately, this is one of the first things that programmers seem to want to do when they learn regexes. For example, consider the idiocy it took to write this:

$data =~ s{(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3
+)?)([^>]+>)}
 {$1.decode_entities($2).$4}gsei;
[download]

I can call that idiocy because I'm the fool who wrote it.

You'll find tons of sites that say it's okay to parse HTML with regexes (even Microsoft, though the link to their site is now broken), but we know better, right? Unfortunately, some of the hacks we use to get various HTML parsers to match the ugly HTML can get pretty ugly. It's not just a matter of designers forgetting to quote attributes, but also of matching a variable number of tags, or maybe some documents close paragraph tags and others don't. Being able to use regular expression semantics on HTML would be pretty nice. Well, now you can (sort of).

When Perl was first released, regular expressions matched bytes. That means you could forget about Unicode. Later, Unicode support was added so you could properly match characters, but after reading Dominus' article about how to build a regular expression engine, it occurred to me that instead of matching bytes or characters, I could match tokens.

What follows is a very simple demonstration of the Token::Regex module that you can download from this link. I'm not putting it on CPAN (but perhaps in the future) because it's very alpha code and, in any event, it's just a hack on the Regex.pm module that Dominus provided, so I'm not sure that my uploading would be appropriate prior to asking for his consent (of course, this code is so fragile I'd be embarrassed to upload it).

First, I need to create appropriate tokens. In my scheme, tokens merely need to have an identifier() method to work. So lets create a small class that allows HTML tokens that are tags to produce identical identifiers if and only if they are the same type of token and have identical attribute names (regardless of order or value). In other words, the following tokens will be considered identical for the purpose of a regular expression match:

<p class="foo" name="bar">
<p name="bar" CLASS="ovid has no class">
<p NAME="bar" class="ovid has no class">
<p name="bar" class="ovid has no class">
[download]

Here's one way we could code the package:

package HTML::Token;

sub new {
    my ($class,$token) = @_;
    my $self = bless { token => $token }, $class;
    my $identifier = $self->identifier;
    return $self;
}

sub identifier {
    my $self  = shift;
    my $token = $self->{token};
    my $attributes = $token->return_attrseq;
    my $tag = $token->return_tag;
    if (ref $attributes eq 'ARRAY') {
        return sprintf "%s %s", $tag, join ' ', sort @$attributes;
    }
    else {
        return $tag;
    }
}
[download]

Now let's create a regular expression array. To use this, every element is either a token or a single regex meta character. See the POD for an allowed list of meta characters.

my $html = <<END_HTML;
<h1>This is a test</h1>
<p class="foo" name="bar">so what???
END_HTML
my $parser = HTML::TokeParser::Simple->new(\$html);
my @tokens = ();
while (my $token = $parser->get_tag) {
    push @tokens => HTML::Token->new($token);
}
push @tokens => (qw[* . *]); # make the (p) tag zero or more, followed
+ by anything
[download]

Note the last line. With this regex engine, all regexes are bound to the beginning of the array and the end of the array. To "unbind" them, you have to include a dot star at the beginning and end.

The above regular expression says "match an opening H1 tag with no attributes followed by a closing H1 tag followed by one or more P tags with attributes of "class" and "name" followed by zero or more of anything". (deep breath).

And to set it up:

use Token::Regex;
my $regex = Token::Regex->new('HTML::Token');
$regex->parse(\@tokens);
[download]

We're now ready to match some HTML. Just provide an array ref of tokens to the match() method:

my $tokens = html_tokens(<<END_HTML);
<h1>This is html</h1>
<p name="bar" CLASS="ovid has no class">so what???
<p NAME="bar" class="ovid has no class">so what???
<p name="bar" class="ovid has no class">so what???
<h2>and this is okay</h2>
END_HTML

if ($regex->match($tokens)) {
    print "Yes\n";
}
else {
    print "No\n";
}

$tokens = html_tokens(<<END_HTML);
<p name="bar" CLASS="ovid has no class">so what???
<p NAME="bar" class="ovid has no class">so what???
<p name="bar" class="ovid has no class">so what???
<h2>and this is okay</h2>
END_HTML

if ($regex->match($tokens)) {
    print "Yes\n";
}
else {
    print "No\n";
}


sub html_tokens {
    my $html = shift;
    
    my $parser = HTML::TokeParser::Simple->new(\$html);
    my @tokens;
    while (my $token = $parser->get_tag) {
        push @tokens => HTML::Token->new($token);
    }
    return \@tokens;
}
[download]

The first bit of HTML should match, but the second should fail (due to no H1 tag).

As stated previously, this is very fragile code, so using this in production is not a good idea. However, note that it can be used with any tokens that you care to create so long as you create an appropriate identifier() method. Further, backreferences cannot be added to this code and I don't know enough about regular expression engines to rewrite the code to support that. Parentheses are currently for grouping.

Furthermore, the code is difficult to use and ugly as sin. It's also slow and I don't know that anyone would really have a use for it, but I thought it was a neat hack. Any suggestions for making it easier would be welcome.

Cheers,
Ovid

New address of my CGI Course.

In reply to How to use Regular Expressions with HTML by Ovid

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks