Re: Regexes on Streams

I thought I'd mention the solution (to the general problem) Dominus uses in his upcoming book 'Perl Advanced Techniques Handbook' - for appropriate values of solution, of course, the general caveats apply. For more explanation subscribe to his mailing list, you should then get access to the complete chapter, out of which the following code is taken:

#!/usr/bin/perl
use strict;

open my $fh, $ARGV[0] or die "Couldn't open '$ARGV[0]': $!";

my $iter = records( blocks($fh), qr/\s*,\s*/ );

while ( defined(my $rec = $iter->()) ) {
    print $rec, "----\n";
}

sub blocks {
    my $fh = shift;
    my $blocksize = shift || 8192;
    sub {
        return unless read $fh, my($block), $blocksize;
        return $block;
    }
}

sub records {
    my $blocks = shift;
    my $terminator = @_ ? shift : quotemeta($/);
    my @records;
    my ($buf, $finished) = ("");
    sub {
        while (@records == 0 && ! $finished) {
            if (defined(my $block = $blocks->())) {
                $buf .= $block;
                my @newrecs = split /($terminator)/, $buf;
                while (@newrecs > 2) {
                    push @records, shift(@newrecs).shift(@newrecs);
                }
                $buf = join "", @newrecs;
            } else {
                @records = $buf;
                $finished = 1;
            }
        }
        return shift(@records);
    }
}
[download]

The blocks sub creates an iterator out of a filehandle which returns chunks of a certain blocksize from this filehandle. In this context an iterator is a reference to a function, which when called like this $iter->() returns the next item or undef when the sequence is exhausted. Other methods of creating such a block-creating iterator could be used instead of this sub.

The records sub creates an iterator that returns single records delimited by a terminator - much like the standard perl input iterator <$filehandle> in combination with $/. The difference is that a regular expression is used as a terminator, no longer a fixed length string.

-- Hofmator

Comment on Re: Regexes on Streams - Revisited! Select or Download Code

Replies are listed 'Best First'.
Re: Re: Regexes on Streams - Revisited! by tilly (Archbishop) on Oct 14, 2003 at 14:50 UTC
I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block. Furthermore Dominus' code should come with a caveat that it won't work entirely as expected if $/ contains a capturing set of parens. (I already mentioned these issues to him.)	[reply]
Re: Regexes on Streams - Revisited! by Dominus (Parson) on Jan 04, 2004 at 07:09 UTC
Says tilly: I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block. Now, now. You don't know whether I thought about it carefully; you were not there, and you cannot see into my brain. The code may be wrong, or broken, or whatever. But in my opinion, I did think carefully about it. And on this matter, my opinion is the only informed one. (I already mentioned these issues to him.) Yes, but not in a way I could understand. I asked you to clarify several points, but I did not receive any reply from you. I think that providing a test case that demonstrated the problem would have been clearer and more helpful than what you did write either in private email or here. Unfortunately, if you have produced one, I have not seen it. Update: OK, I've now seen the example you posted later on. Thanks very much for posting it. The example was indeed much clearer and more helpful than either of your earlier messages. Can I suggest that since what you were doing here was essentially the same as reporting a bug, that the usual rules of good practice in bug reporting should apply? The test case was a lot more useful than any amount of additional verbiage would have been. Thanks again. I will try to repair this bug before the book is published. -- Mark Dominus Perl Paraphernalia	[reply]
Re: Re: Regexes on Streams - Revisited! by tilly (Archbishop) on Jan 04, 2004 at 08:36 UTC
If you did not receive a reply from me, it is not because I didn't try to send you one. My "sent messages" folder on Operamail has a copy of the reply that I sent with the exact same code that I posted later here. Operamail has not been entirely reliable though, and I cannot tell whether it reached you. As for your thinking process, I don't think that I leapt to any wild conclusions. You didn't deal with the bug that tsee was trying to handle, and didn't indicate that you were aware of it. When I pointed out the existence of the bug, you confirmed that you knew of no such issue in your code. Given that, I think it was fair to conclude that you had not carefully thought through this particular boundary case.	[reply]
Re3: Regexes on Streams - Revisited! by Hofmator (Curate) on Oct 16, 2003 at 07:55 UTC
I should note that Dominus' solution suffers from exactly the bug that tsee is trying to work around, except worse because he didn't think carefully about what happens if $/ wants to do a greedy match across a block. I think I'm misunderstanding something here then. I don't see any bug in Dominus' code. To clarify, let me restate the problem: Given an input stream and a terminator (regex) pattern, break the stream into chunks, each ending in something the terminator matches. Naturally a problem arises when the terminator pattern matches (or contains) something potentially infinite, like `qr/./`. Then you get in memory problems, etc. But where is a bug in that behaviour? The code is doing what it's supposed to do. How can it complete something that is impossible (provided the stream is infinite)? And I don't understand how you could work around that. Furthermore Dominus' code should come with a caveat that it won't work entirely as expected if $/ contains a capturing set of parens.* Yes, it should, but his code is a textbook example which should not have to deal with all corner cases because the focus should stay on the problem at hand. In 'real' code I'd do some kind of inspection of the regex in order to catch that. And, btw, what sense is there in putting capturing parens into a terminator pattern? -- Hofmator	[reply] [d/l]
Re: Re3: Regexes on Streams - Revisited! by tilly (Archbishop) on Oct 16, 2003 at 14:38 UTC
You don't see it as buggy that the way that $/ matches a record depends on the layout of the data and your buffer size in some rather complex way, with the only guarantee being that if you suck in the whole stream then it will work as expected? It isn't enough to say, "The program does what it is coded to." If you are going to promise to allow people to use regexes to match input, then make some attempt to do it consistently, and if you can't then make some attempt to at least fail in an easily explainable way. For instance I can be OK with, "I can produce strange results if $/ is supposed to match more than one block." If you can't deliver it, then don't even seem to promise delivering it. If you want to attempt to deliver the promise, then one idea is the approach that I suggested at Re: Regexes on Streams, which the code above implements a variation on. If the programmer wants to give a potentially infinite match, you do your best to satisfy them. Perhaps it is `qr/./` and life sucks. Perhaps it is `/[\r\n](?:\s[\r\n]\|)/` (ie match end of line and any following blank lines, with either Unix or DOS line endings) and even though it is potentially infinite, with real data it is also pretty sensible and shouldn't be broken too easily. (Note: the $/ example that Dominus used in his chapter was potentially infinite...) A sample program to play with is the following. Save it and feed it different buffer sizes. The end of record expression is greedy, of size 12. Yet from sizes of 1-17 only twice does it produce the result which Dominus' original description would lead you to expect. And in a longer data example, it would continue to mess up, and there is no simple way to say that it shouldn't. #! /usr/bin/perl -w use strict; sub blocks { my $fh = shift; my $blocksize = shift \|\| 8192; sub { return unless read $fh, my($block), $blocksize; return $block; } } sub records { my $blocks = shift; my $terminator = @_ ? shift : quotemeta($/); my @records; my ($buf, $finished) = (""); sub { while (@records == 0 && ! $finished) { if (defined(my $block = $blocks->())) { $buf .= $block; my @newrecs = split /($terminator)/, $buf; while (@newrecs > 2) { push @records, shift(@newrecs).shift(@newrecs); } $buf = join "", @newrecs; } else { @records = $buf; $finished = 1; } } return shift(@records); } } my $iter_block = blocks(\*DATA, shift \|\| 10); my $iter_record = records($iter_block, "(?:foo\n)+"); while (my $record = $iter_record->()) { print "GOT A RECORD:\n$record\n"; } __DATA__ hello foo foo foo world foo foo foo ! [download] And about using capturing parens into a terminator pattern, there is no real reason for wanting to do so, but it is incredibly easy to do by accident when not thinking about it. If you don't know a fair amount about regexes, you could take a good while to figure out why things look weird and what to do to fix it. (Confession: When writing the above sample code I originally had capturing parens and was somewhat surprised at the results. This is how I became aware of that issue...)	[reply] [d/l] [select]
Re5: Regexes on Streams - Revisited! by Hofmator (Curate) on Oct 16, 2003 at 14:49 UTC
Re: Re: Regexes on Streams - Revisited! by tsee (Curate) on Oct 14, 2003 at 17:11 UTC
I'm on the list. Unfortunately, I'm facing theoretical electrodynamics right now, so I haven't been able to read Chapter IX (?) yet and I don't want to without the proper dedication to provide useful feedback. Turns out I should've taken the time a while ago, but as tilly mentioned, MJD's code has some important caveats. Steffen	[reply]

Re: Regexes on Streams - Revisited!