Update: Fixed a code typo that Abigail-II pointed out, thus making those comments seem a bit odd if you hadn't seen the typo.
Update 2: You can download a slightly updated version of the module. This version does not emit any warnings and I fixed a bug in the tests for Regexp::Token::HTML.
In my continuing quest to make regular expressions that match tokens, I've created the Regex::Token module. The above link is to the download and not to the CPAN as it's too early even for the CPAN. It throws tons of warnings, the documentation is not complete and I need to change the interface to allow different matching semantics from different tokens in the same class (if desired).
A short example of how this module is used:
my $p_token = Regexp::Token::HTML->create_token('<p name="" class=""> +'); my $p_tag = Regexp::Token->create($p_token); $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'> + }; is($result, $two_tags, '... and we should be able to capture token te +xt');
The above code actually works and is included in the tests, though it throws tons of warnings. Feel free to download the module, hack on it and tell me what you think.
POD follows. I'm leery of writing much more documentation until I get the interface stable, but reading the code and the tests, combined with the POD below should clear things up (though this is some pretty strange code).
NAME
Regexp::Token - Perl extension for matching tokens instead of characters
SYNOPSIS
my $regex = Regexp::Token->create($token); my $text =~ /foo(${regex})bar/; print $1;
ABSTRACT
This module allows the programmer to create arbitrary tokens and match them using regular expressions. Requires Perl 5.6 or better;
DESCRIPTION
Token Interface
Regexp::Token requires a token to be passed to its create method. This token must have at three methods that can be called on it:
- to_string()
This method must return the exact text used to create the token. This is
what will actually be used when matching in the regular expression.
- identifier
This must return a string used to match tokens. Two tokens are considered to
match if they return identical strings from their identifier methods.
- create_token
When the token regex is first encountered in a regex, the remaining portion of
the regular expression that is to match will be fed to the token which will
then create a new token based upon that string. That token in turn will be
matched against the token that created it using the identifier method and,
if the match, the created token's to_string method will be called and this
text will be added to the regular expression
Really, it's simpler than it looks.
Sample using HTML tokens
With this package is bundled Regexp::Token::HTML. This will create tokens for HTML tags. These tokens conform to the previously described interface. The identifer() method currently returns a string with the type of HTML tag followed by the attributes in a sorted order and lower-case. For example:
<input type="text" NAME="foobar">
Will return the following identifier:
$identifier eq 'input name type';
Thus, every HTML ``input'' tag with type and name attributes (and no other attributes), will be considered identical.
Here's a sample that will match a paragraph tag (values are not supplied because they are superflous):
my $p_token = Regexp::Token::HTML->create_token('<p name="" class="" +>'); my $p_tag = Regexp::Token->create($p_token); my $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy' +> }; is($result, $two_tags, '... and we should be able to capture token t +ext');
Currently, it does just that, but it throws plenty of warnings in the process. I'll fix those later.
EXPORT
None by default.
SEE ALSO
AUTHOR
Curtis ``Ovid'' Poe, <poec@yahoo.com>
COPYRIGHT AND LICENSE
Copyright 2003 by Curtis ``Ovid'' Poe
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
BUGS
Lots, I'm sure. Email me.
CAVEATS
This is very alpha software. Its interface will change.
Further, this module uses fork() internally because of the possibility that the tokens supplied to the create() method might call regular expressions themselves. If this happens while embedded in a regex, you'll screw up the first regex, thus forcing me to fork a new process to ensure that the extra regexes don't clash.
Further, the documentation isn't complete. See the tests for more info.
Cheers,
Ovid
New address of my CGI Course.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Regexp::Token -- Use regular expressions to match tokens
by Abigail-II (Bishop) on Aug 25, 2003 at 08:04 UTC | |
by Ovid (Cardinal) on Aug 25, 2003 at 13:18 UTC |