http://www.perlmonks.org?node_id=286313

Update: Fixed a code typo that Abigail-II pointed out, thus making those comments seem a bit odd if you hadn't seen the typo.

Update 2: You can download a slightly updated version of the module. This version does not emit any warnings and I fixed a bug in the tests for Regexp::Token::HTML.

In my continuing quest to make regular expressions that match tokens, I've created the Regex::Token module. The above link is to the download and not to the CPAN as it's too early even for the CPAN. It throws tons of warnings, the documentation is not complete and I need to change the interface to allow different matching semantics from different tokens in the same class (if desired).

A short example of how this module is used:

my $p_token = Regexp::Token::HTML->create_token('<p name="" class=""> +'); my $p_tag = Regexp::Token->create($p_token); $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'> + }; is($result, $two_tags, '... and we should be able to capture token te +xt');

The above code actually works and is included in the tests, though it throws tons of warnings. Feel free to download the module, hack on it and tell me what you think.

POD follows. I'm leery of writing much more documentation until I get the interface stable, but reading the code and the tests, combined with the POD below should clear things up (though this is some pretty strange code).


NAME

Regexp::Token - Perl extension for matching tokens instead of characters


SYNOPSIS

 my $regex = Regexp::Token->create($token);
 my $text  =~ /foo(${regex})bar/;
 print $1;


ABSTRACT

This module allows the programmer to create arbitrary tokens and match them using regular expressions. Requires Perl 5.6 or better;


DESCRIPTION

Token Interface

Regexp::Token requires a token to be passed to its create method. This token must have at three methods that can be called on it:

  • to_string()
  • This method must return the exact text used to create the token. This is what will actually be used when matching in the regular expression.

  • identifier
  • This must return a string used to match tokens. Two tokens are considered to match if they return identical strings from their identifier methods.

  • create_token
  • When the token regex is first encountered in a regex, the remaining portion of the regular expression that is to match will be fed to the token which will then create a new token based upon that string. That token in turn will be matched against the token that created it using the identifier method and, if the match, the created token's to_string method will be called and this text will be added to the regular expression

Really, it's simpler than it looks.

Sample using HTML tokens

With this package is bundled Regexp::Token::HTML. This will create tokens for HTML tags. These tokens conform to the previously described interface. The identifer() method currently returns a string with the type of HTML tag followed by the attributes in a sorted order and lower-case. For example:

 <input type="text" NAME="foobar">

Will return the following identifier:

 $identifier eq 'input name type';

Thus, every HTML ``input'' tag with type and name attributes (and no other attributes), will be considered identical.

Here's a sample that will match a paragraph tag (values are not supplied because they are superflous):

my $p_token = Regexp::Token::HTML->create_token('<p name="" class="" +>'); my $p_tag = Regexp::Token->create($p_token); my $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy' +> }; is($result, $two_tags, '... and we should be able to capture token t +ext');

Currently, it does just that, but it throws plenty of warnings in the process. I'll fix those later.

EXPORT

None by default.


SEE ALSO


AUTHOR

Curtis ``Ovid'' Poe, <poec@yahoo.com>


COPYRIGHT AND LICENSE

Copyright 2003 by Curtis ``Ovid'' Poe

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


BUGS

Lots, I'm sure. Email me.


CAVEATS

This is very alpha software. Its interface will change.

Further, this module uses fork() internally because of the possibility that the tokens supplied to the create() method might call regular expressions themselves. If this happens while embedded in a regex, you'll screw up the first regex, thus forcing me to fork a new process to ensure that the extra regexes don't clash.

Further, the documentation isn't complete. See the tests for more info.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Regexp::Token -- Use regular expressions to match tokens
by Abigail-II (Bishop) on Aug 25, 2003 at 08:04 UTC
    my $p_token = Regexp::Token::HTML->create_token('<p name="" class="">' +); my $p_tag = Regexp::Token->create($p_tag);

    Is the last line a typo and should the argument to Regexp::Token::create be $p_token or is there some voodoo magic in Regexp::Token?

    Abigail

      You're right, that's just me making a bit of a typo. It was in the docs and the tests, but I've fixed both and uploaded a new distribution. Thanks for the catch.

      Cheers,
      Ovid

      New address of my CGI Course.