|Perl: the Markov chain saw|
Applying regexes to streams: Perl enhancement ideaby tye (Sage)
|on Jan 07, 2003 at 21:22 UTC||Need Help??|
A recent discussion with Limbic~Region in the CB highlighted the possibility for a neat enhancement for Perl regexes. I hope this can make it into Perl 6 regexes, but I'll mostly talk in terms of Perl 5 regexes as I figure nearly everyone can understand that.
It would be nice to be able to efficiently use regexes on streams. You can do it now (though not that efficiently) so long as you set yourself a maximum match size. But that isn't always easy to figure out. This is probably part of why $/ is a fixed string and not a regex.
[ For the rest of the discussion, note that when I say "string", I am thinking of this: ( $match )= $string =~ /($pattern)/;
For an example of matching against a stream where a maximum match size can be determined, see Re: Shell Script Woes (tye's try). Note how I compute the longest possible match size and ensure that the string being given to the regex is always at least that long (until I get near the end of the stream).
My idea for an enhancement would be a new option that would tell the regex engine that reaching the end of the string would cause the regex to note the current pos() and then fail. I'll do this with /z just to make japhy happy (well, it also makes a bit of sense).
So the regex would fail, returning control to your code. Your code could then decide how much more data from the stream to append to the string and perhaps trim data from the front of the string that we know won't be part of any future match.
So you could then regex against a stream like so:
This allows things to be efficient (the regex engine can start matching at the position where it left off last time) and doesn't require some arbitrary limit on maximum match size to be specified.
Note how //z would mostly be useful when doing //zg in a scalar context so you could recover pos(). Also, if you do //zg and not //zgc, then pos() would be undefined if the regex fails without hitting the end of the string so you could give up on streams that are never going to match no matter how much data you collect (though using such a regex on a stream would be strange).
Note that the regex engine does redo some of its work: the work from the last time it incremented pos(). We could avoid this by either saving the entire state of the regex engine (so difficult that it probably requires continuations) or by letting the regex engine call user code to "grow" the string being matched against.If we go the latter route, I'd still like to be able to tell the regex engine to fail as implementing features only via callbacks tends to be rather limiting for the coder. For example, in my example above, only providing the callback solution could require the buffer to grow extremely large in the case of successive matches being very far apart even when the matches themselves are quite small. - tye
Updated as described in Re^4: Applying regexes to streams: Perl enhancement idea (bug+fix). Original code inside CODE tags in HTML comments, so the "d/l code" link will fetch both versions of the code.