http://www.perlmonks.org?node_id=298436

I suppose this a bit of an expert question.
perlvar states that "[...] the value of "$/" is a string, not a regex.". Which is sad. And no longer strictly true.

Doing regex matching on streams is tricky at best and in order to work really well in all cases, it would require a different regular expression engine than perl's.

So I wrote File::Stream. Not implementing a regular expression engine (insert maniac laughter here), but implementing "regexes on streams" by means of progressive buffering -> matching -> buffer expansion -> matching... (see below for some comments on inherent problems with this approach)

With the current implementation, you can already do things like this:

use File::Stream; my $stream = File::Stream->new($filehandle); $/ = qr/\s*,\s*/; print "$_\n" while <$stream>;
It can also do quite a bit more, so consider having a look at the module's synopsis, the pasting of which is considered a waste of screen space here. A few important problems, however, remain.

Most importantly, infinite regexes on streams tend to introduce infinite strings into your memory. Too bad we don't live in the ideal Turing machine world, but this can't be helped.
Furthermore, given that regexes are used on the current buffer, they may match less than they would if the next X bytes were also part of the buffer. Like the former issue, this likely cannot be fixed for good.

  • Is there a robust, pure-Perl way of inspecting regular expressions for possibly infinite constructs? Anything non-extreme but involving XS?
  • The problem that regexes might match on the current buffer contents, but would match more if it could might be halfway fixed by reading in another block from the stream and reperforming the match. Repeat until the match stays the same over n read operations. Weirdness? Heuristic? Or a fix?
  • Is it possible to achieve usage like this:
    use File::Stream::Improved; $/ = qr/regex/; my @records = <HANDLE>; # where HANDLE might also be $handle
    The significant difference to the currently working code is that $handle/HANDLE needn't be a File::Stream tied handle, but may be just any filehandle.
    Any ideas?

Steffen

Replies are listed 'Best First'.
Re: Regexes on Streams
by tilly (Archbishop) on Oct 11, 2003 at 17:50 UTC
    It would be far better if Perl had internal support of some kind for this, but look at the following:
    (?=\z(?{ die "end of string" })|)
    This convoluted construct will cause the match to die if you reach the end of string at a given point in the RE. You can trap this in an eval. Take an RE and sprinkle these heavily and you can guarantee that if the RE reached the end of your current text at any point during the RE match, then you will find out about it and know to grab more text. You will need to tokenize the RE carefully to figure out where to insert the end of string tests. (And don't forget to remove them if you know that there is no more input to add!)

    I looked around for a simpler way to do this. I didn't find one. It would be ideal if there was a regexp modifier to set "end of string" behaviour. But nobody seems to have implemented that...

      Genius! Madness!
      This is such a good idea I'm jealous I didn't have it. No, wait. It's so mind-boggingly hackish I'm glad... Whatever, it's just a very cool hack.
      I'll play with it and see how I can make the buffer extension work with it. Current implementation of the module (not on CPAN yet) features a somewhat simpler approach that requires that a match stays exactly the same before and after a buffer extension. Thus, if the user is knowledgeable to use regexes that match delimiters shorter than what they set as the block to read per buffer extension, they're *fairly* safe.
      Anyway, I like the ${} approach better even if it's not going to work well. Just for the weirdness of it. :-)

      Steffen
        I didn't have it either. Ilya did. (Or at least had a trivial variation on it.)
Re: Regexes on Streams
by liz (Monsignor) on Oct 11, 2003 at 11:09 UTC
    Some thoughts from your questions:

    Is there a robust, pure-Perl way of inspecting regular expressions for possibly infinite constructs? Anything non-extreme but involving XS?

    Maybe Mark-Jason Dominus' Rx module will allow you to inspect a regular expression enough to find out about infinite constructs.

    ...by reading in another block from the stream and reperforming the match

    I would set a sort of "timeout" value here, I guess "dataout" value would be more appropriate. ;-)

    ...$handle/HANDLE needn't be a File::Stream tied handle, but may be just any filehandle.

    Maybe you could achieve this by pushing an PerlIO handler on the handle?

    Liz

Re: Regexes on Streams
by Abigail-II (Bishop) on Oct 11, 2003 at 02:34 UTC
    Well, this idea isn't new. It has (of course) been discussed on p5p. And was never implemented because noone could find a solution to the problems you list. What are you going to do with /.*;/, which basically asks you to find the last semi-colon in the stream?

    Abigail

      What are you going to do with /.*;/,...

      When matching a stream, I would let this cause an execution error. In my mind, it's similar to division by 0. And that also causes an execution error.

      Liz

      Nothing. Anybody asking for any *last* thing in a stream is asking for trouble! Streams are of infinite length by concept, which is something one should be aware of before trying to combine streams and regexes.

      So in brief: /.*;/ should yield the whole stream in-memory. Which is exactly what the author of the regex asked for.

      Steffen
Re: Regexes on Streams (a partial solution?)
by BrowserUk (Patriarch) on Oct 12, 2003 at 02:14 UTC

    Starting with tilly's idea, and attempting to generalise it, I came up with this.

    #! perl -slw use strict; use re 'eval'; sub Re_Stream { my( $re_user, $extend ) = @_; die "Usage: Re_Stream( regex, coderef )" unless defined $re_user and ref $extend eq 'CODE'; return qr[ (?: \Z (?(?{ $extend->() })|(?!) }) ) | $re_user ]x; } my $buf = 'abcdefghijklmnopqrstuvwxyz'; my $c = 'A'; sub extend{ $buf .= ($c++) x 100; return length $c < 2 } my $re_stream = Re_Stream( qr[(..)(...)], \&extend ); print $re_stream; my $i = 0; print "${ \++$i }: $1|$2" while $buf =~ m[$re_stream]g;

    The sub Re_Stream(), takes a regex and a coderef. The regex can be any regex (in theory:), and the coderef should be a function that will extend the stream beyond it's current limit. This function should return true if it has extended the stream, and false if there is no more to come.

    As coded, the while running the regex will continue to match against the stream until the extender function returns false. I'm not sure if this is progress. The upside is that you no longer have to inspect the user's regex in ordr to work out where to insert the code block to extend the buffer. In fatc you don't have to modify the user regex at all. However, there are a couple of problems with it as it stands.

    1. If the match crosses the boundary of the buffer being extended, a null match is returned.
    2. Ay attempt I made to shorten the pre-trucate the string, Ie. To discard some part of the front of the string that had already been processed seemed to "confuse" the regex.
    3. As is, it requires use re 'eval'; which may or may not be a problem.

    I've only made a half-hearted attempt at fixing these so far, but thought that I would throw it open to see if anyone else can take it further, or dismiss it as unworkable.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

Re: Regexes on Streams
by Aristotle (Chancellor) on Oct 13, 2003 at 19:59 UTC
    Does YAPE::Regex do what you need for RE inspection?

    Makeshifts last the longest.

      Yes it does, thank you. It's in the new version on CPAN (1.10).