perlmeditation
tsee
<p>
There's been a meditation called [Regexes on Streams] recently that deals with the evil I've been doing in the [cpan://File::Stream] module and I received much valuable feedback. (Much thanks to those who offered their advice.) I suggest you have a look at the above node first because this is what has happened to the module since.
</p>
<readmore>
<p>
The find() method is used internally by readline(). This is where the weirdness happens.
</p>
<p>
Starting out with getting the function arguments (@terms is the set of strings/regexes/objects to incorporate into out regular expression). "use re 'eval'" is needed for the ${} regex construct which we'll be using to do action-at-a-distance when reaching the end of the current string buffer. $End_Of_String is a global that will be incremented on encountering the end of the buffer in the regex. Lexicals did not work here due to some ${} weirdness.
<code>
sub find {
my $self = shift;
my @terms = @_;
use re 'eval';
$End_Of_String = 0;
</code>
Transforming the input strings/regexes/objects into compiled regexes first (the second map). Then, every regex is deparsed using YAPE::Regex and reconstructed as a string with '(?:\z(?{$End_Of_String++})(?!)|)' after every token. The result is then compiled.
<code>
my @regex_tokens =
map {
my $yp = YAPE::Regex->new($_);
my $str = '';
my $token;
while ($token = $yp->next()) {
$str .= $token->string() .
'(?:\z(?{$End_Of_String++})(?!)|)';
}
qr/$str/;
}
map {
if ( not ref($_) )
{
qr/\Q$_\E/;
}
elsif ( ref($_) eq 'Regexp' ) {
$_;
}
else {
my $string = "$_";
qr/\Q$string\E/;
}
} @terms;
</code>
Some more on that weird piece of regular expression:<br/> '(?:\z(?{$End_Of_String++})(?!)|)'<br/>
We match for \z, the end of the string. If that isn't currently the case, the | way at the end of the regex comes in and matches the empty string. Voila - effectively a no-op unless at the end of the string. If the \z matches, the code in (?{}) is executed (that is, $End_Of_String incremented).
</p>
<p>
Now we construct one final regex with capturing parens around each of the bunch of we munged above. We compile it.
<code>
my $re = '(' . join( ')|(', @regex_tokens ) . ')';
my $compiled = qr/$re/s;
</code>
We match against the buffer. If either the $End_Of_String var was set via the regex or we didn't match the string at all, the global's reset and we append more data to the buffer. Repeat until match.<br/>
Once we have a match, we determine which capturing group matched and remove everything up to after the match from the buffer. Then, the string pre-match and the match itself are returned from find().
<code>
while (1) {
my @matches = $self->{buffer} =~ $compiled;
if ($End_Of_String or not @matches) {
$End_Of_String = 0;
return undef unless $self->fill_buffer();
next;
}
else {
my $index = undef;
for ( 0 .. $#matches ) {
$index = $_, last if defined $matches[$_];
}
die if not defined $index; # sanity check
my $match = $matches[$index];
$self->{buffer} =~ s/^(.*?)\Q$match\E//s or die;
return ( $1, $match );
}
}
}
</code>
</p>
<p>
Can you spot any bugs? (I found one while writing the above.)
</readmore>
<p>
Steffen
</p>