There's been a meditation called Regexes on Streams recently that deals with the evil I've been doing in the File::Stream module and I received much valuable feedback. (Much thanks to those who offered their advice.) I suggest you have a look at the above node first because this is what has happened to the module since.

The find() method is used internally by readline(). This is where the weirdness happens.

Starting out with getting the function arguments (@terms is the set of strings/regexes/objects to incorporate into out regular expression). "use re 'eval'" is needed for the ${} regex construct which we'll be using to do action-at-a-distance when reaching the end of the current string buffer. $End_Of_String is a global that will be incremented on encountering the end of the buffer in the regex. Lexicals did not work here due to some ${} weirdness.

sub find { my $self = shift; my @terms = @_; use re 'eval'; $End_Of_String = 0;
Transforming the input strings/regexes/objects into compiled regexes first (the second map). Then, every regex is deparsed using YAPE::Regex and reconstructed as a string with '(?:\z(?{$End_Of_String++})(?!)|)' after every token. The result is then compiled.
my @regex_tokens = map { my $yp = YAPE::Regex->new($_); my $str = ''; my $token; while ($token = $yp->next()) { $str .= $token->string() . '(?:\z(?{$End_Of_String++})(?!)|)'; } qr/$str/; } map { if ( not ref($_) ) { qr/\Q$_\E/; } elsif ( ref($_) eq 'Regexp' ) { $_; } else { my $string = "$_"; qr/\Q$string\E/; } } @terms;
Some more on that weird piece of regular expression:
We match for \z, the end of the string. If that isn't currently the case, the | way at the end of the regex comes in and matches the empty string. Voila - effectively a no-op unless at the end of the string. If the \z matches, the code in (?{}) is executed (that is, $End_Of_String incremented).

Now we construct one final regex with capturing parens around each of the bunch of we munged above. We compile it.

my $re = '(' . join( ')|(', @regex_tokens ) . ')'; my $compiled = qr/$re/s;
We match against the buffer. If either the $End_Of_String var was set via the regex or we didn't match the string at all, the global's reset and we append more data to the buffer. Repeat until match.
Once we have a match, we determine which capturing group matched and remove everything up to after the match from the buffer. Then, the string pre-match and the match itself are returned from find().
while (1) { my @matches = $self->{buffer} =~ $compiled; if ($End_Of_String or not @matches) { $End_Of_String = 0; return undef unless $self->fill_buffer(); next; } else { my $index = undef; for ( 0 .. $#matches ) { $index = $_, last if defined $matches[$_]; } die if not defined $index; # sanity check my $match = $matches[$index]; $self->{buffer} =~ s/^(.*?)\Q$match\E//s or die; return ( $1, $match ); } } }

Can you spot any bugs? (I found one while writing the above.)


Replies are listed 'Best First'.
Re: Regexes on Streams - Revisited!
by Aristotle (Chancellor) on Oct 14, 2003 at 09:00 UTC
    Be aware that switching from die to incrementing a variable means the engine will continue trying to match (potentially expensive) alternatives after the first premature end of string condition. dieing would short circuit immediately. If you want to avoid recompiling once the stream has run out, you could use something like die unless $end_of_stream where the flag is set by the main loop when the respective condition is met.

    Makeshifts last the longest.

      Yes, the bug I mentioned is related to that.
      Anyhow, die-ing out of the regex caused my perl to dump core (win32 multithread 5.6.1 -- ActivePerl), so that's not an option. Incrementing a lexical variable $end_of_stream worked only in *some* cases. Printing something from the (${}) construct worked okay, but the incrementing action-at-a-distance did not work every time. Thus the use of a package variable.

      Back to that bug. The (?!) construct causes the current branch of the match to fail. That's nice but, as far as I can tell, entirely uneffective as this are because of the | at the end of the inserted end-of-string-tracking regex group.
      Instead, one'd want to modify the inserted regular expression to quit trying to match once the code construct is reached. (Ideally, it'd just fail which would make the whole experimental code construct unnecessary.)

      Unfortunately, I'm currently not able to spend much time on finding such a regex. (Read: almost none, I have to continue studying now.)

        Btw, it is probably cleaner to write
        (?:\z(?{ die })|)
        (?(?=\z)(?{ die }))

        It's too bad your Perl segfaults on die from inside a regex.. that doesn't happen for me (5.8.0 Linux nothreads).

        Your troubles with using a lexical are possibly due to these code blocks inside regexes being closures; were you aware of that?

        Unfortunately, there is currently no way to tell the regex engine to fail the entire match immediately, which is why die is necessary. It will work in Perl6, but then, so will matching on streams.. :)

        The only solution is to do what we did in the days of Pascal to cope with the lack of last and friends: nest conditionals. In terms of the pattern matching, that means an attempt to match

        becomes something like
        (?(?!\z) .*? (?(?!\z) abc (?(?!\z) (def)? | (?{ $PREMATURE_INPUT_END++ }) ) | (?{ $PREMATURE_INPUT_END++ }) ) | (?{ $PREMATURE_INPUT_END++ }) )

        Makeshifts last the longest.

Re: Regexes on Streams - Revisited!
by Hofmator (Curate) on Oct 14, 2003 at 14:27 UTC

    I thought I'd mention the solution (to the general problem) Dominus uses in his upcoming book 'Perl Advanced Techniques Handbook' - for appropriate values of solution, of course, the general caveats apply. For more explanation subscribe to his mailing list, you should then get access to the complete chapter, out of which the following code is taken:

    #!/usr/bin/perl use strict; open my $fh, $ARGV[0] or die "Couldn't open '$ARGV[0]': $!"; my $iter = records( blocks($fh), qr/\s*,\s*/ ); while ( defined(my $rec = $iter->()) ) { print $rec, "----\n"; } sub blocks { my $fh = shift; my $blocksize = shift || 8192; sub { return unless read $fh, my($block), $blocksize; return $block; } } sub records { my $blocks = shift; my $terminator = @_ ? shift : quotemeta($/); my @records; my ($buf, $finished) = (""); sub { while (@records == 0 && ! $finished) { if (defined(my $block = $blocks->())) { $buf .= $block; my @newrecs = split /($terminator)/, $buf; while (@newrecs > 2) { push @records, shift(@newrecs).shift(@newrecs); } $buf = join "", @newrecs; } else { @records = $buf; $finished = 1; } } return shift(@records); } }

    The blocks sub creates an iterator out of a filehandle which returns chunks of a certain blocksize from this filehandle. In this context an iterator is a reference to a function, which when called like this $iter->() returns the next item or undef when the sequence is exhausted. Other methods of creating such a block-creating iterator could be used instead of this sub.

    The records sub creates an iterator that returns single records delimited by a terminator - much like the standard perl input iterator <$filehandle> in combination with $/. The difference is that a regular expression is used as a terminator, no longer a fixed length string.

    -- Hofmator

      I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block.

      Furthermore Dominus' code should come with a caveat that it won't work entirely as expected if $/ contains a capturing set of parens.

      (I already mentioned these issues to him.)

        Says tilly:
        I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block.
        Now, now. You don't know whether I thought about it carefully; you were not there, and you cannot see into my brain. The code may be wrong, or broken, or whatever. But in my opinion, I did think carefully about it. And on this matter, my opinion is the only informed one.

        (I already mentioned these issues to him.)

        Yes, but not in a way I could understand. I asked you to clarify several points, but I did not receive any reply from you.

        I think that providing a test case that demonstrated the problem would have been clearer and more helpful than what you did write either in private email or here. Unfortunately, if you have produced one, I have not seen it.

        Update: OK, I've now seen the example you posted later on. Thanks very much for posting it. The example was indeed much clearer and more helpful than either of your earlier messages. Can I suggest that since what you were doing here was essentially the same as reporting a bug, that the usual rules of good practice in bug reporting should apply? The test case was a lot more useful than any amount of additional verbiage would have been.

        Thanks again. I will try to repair this bug before the book is published.

        I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block.
        I think I'm misunderstanding something here then. I don't see any bug in Dominus' code.

        To clarify, let me restate the problem: Given an input stream and a terminator (regex) pattern, break the stream into chunks, each ending in something the terminator matches. Naturally a problem arises when the terminator pattern matches (or contains) something potentially infinite, like qr/.*/. Then you get in memory problems, etc.

        But where is a bug in that behaviour? The code is doing what it's supposed to do. How can it complete something that is impossible (provided the stream is infinite)? And I don't understand how you could work around that.

        Furthermore Dominus' code should come with a caveat that it won't work entirely as expected if $/ contains a capturing set of parens.
        Yes, it should, but his code is a textbook example which should not have to deal with all corner cases because the focus should stay on the problem at hand. In 'real' code I'd do some kind of inspection of the regex in order to catch that. And, btw, what sense is there in putting capturing parens into a terminator pattern?

        -- Hofmator

      I'm on the list. Unfortunately, I'm facing theoretical electrodynamics right now, so I haven't been able to read Chapter IX (?) yet and I don't want to without the proper dedication to provide useful feedback. Turns out I should've taken the time a while ago, but as tilly mentioned, MJD's code has some important caveats.