perlmeditation
grinder
<p>I've been looking at improving [id://117450], namely by fetching
each individual snippet to acquire its date of creation, in order to
create an index page of snippets per month or per year, as per
[stefan k]'s suggestion. So I whipped up a quick script using a
[cpan://LWP::UserAgent] object with a callback to send the received
content to be parsed on the fly by an [cpan://HTML::Parser] object.</p>
<p>It became apparent pretty quickly that the information I needed was
in the first returned chunk. For the remaining chunks there was nothing
left to do. (Of course, a future enhancement could be to count the
number of follow-ups, but that's another story). It seemed to me that
this was pretty inefficient, and an unnecessary drag on the sorely
overloaded Monastery server.</p>
<p>So I started pondering how I could interrupt the download once I had
received the information I needed. I wasn't sure that it was possible,
but at least I had the source to hack in a solution if need be. I had
visions of plumbing the depths of socket wizardry with a kluge of a
global variable to take down the connection, and... um...</p>
<readmore>
<p>After spelunking around for a few minutes (by tracing where my
callback was being passed), I came across
[cpan://LWP::Protocol::http] which contains a sub named
<tt>collect</tt> which does the deed of fetching the bytes (at least to
as low a level as I cared about). There I found the following code,
(which I've roughly paraphrased):</p>
<code>
if($cb) {
while ($content = &$collector, length $$content) {
eval {
&$cb($$content, $response, $self);
};
if ($@) {
chomp($@);
$response->header('X-Died' => $@);
last;
}
}
}
</code>
<p>There it was, all I had to was to <i><tt>die</tt></i> in my callback,
and the connection would be cancelled. I hacked up the following code in
about 10 minutes just to prove to myself that this was the case:</p>
<code>
#! /usr/bin/perl -w
# bloat.cgi
use strict;
print <<HEAD;
Content-Type: text/html
<html><head>bloat.cgi -- a humungous web page</head><body bgcolor="#ffffff">
HEAD
print qq{<p class="foobar" align="right" name="$_">$_</p>\n} for( 1 .. 10000 );
print '</body></html>';
__END__
</code>
<p>(Note: I made up that really long <code><p></code> tag to see whether it was broken across chunk boundaries. If it is, [cpan://HTML::Parser] <i>appears</i> to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)</p>
<code>
#! /usr/bin/perl -w
use strict;
use LWP::UserAgent;
use HTTP::Request;
use HTML::Parser;
my $chunk = 0;
my $p = HTML::Parser->new(
start_h => [ \&begin, 'tagname,attr' ],
default_h => [ \&content, 'text' ],
end_h => [ \&end, 'tagname' ],
);
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => 'http://localhost/cgi-bin/bloat.cgi' );
my $res = $ua->request($req, \&cb);
$p->eof;
sub cb {
my $received = shift;
++$chunk;
$p->parse( $received );
}
sub begin {
my $element = shift;
my $r = shift;
print "received <$element";
print qq{ $_="$r->{$_}"} foreach keys %$r;
print "> at chunk $chunk\n";
}
sub content {
my $content = shift;
print "received [$content] at chunk $chunk\n";
###########################
die if $content eq '123'; #
###########################
}
sub end {
my $element = shift;
print "received </$element> at chunk $chunk\n";
}
__END__
</code>
<p>That seems to work pretty well. Checking the web server logs, I see the following lines appear:</p>
<code>
127.0.0.1 - - [15/Oct/2001:10:40:14 +0200] "GET /cgi-bin/bloat.cgi HTTP/1.0" 200 527879 "-" "lwp-request/1.39"
127.0.0.1 - - [15/Oct/2001:10:40:18 +0200] "GET /cgi-bin/bloat.cgi HTTP/1.0" 200 116807 "-" "libwww-perl/5.53"
</code>
<p><i>Quod erat demonstrandum</i>. Don't let them take my Open Source away.</p>
<small>--</small><h5><tt><font color="#cf1178">g</font> <font color="#bd2188">r</font> <font color="#a73598">i</font> <font color="#904da7">n</font> <font color="#7866b5">d</font> <font color="#6080c2">e</font> <font color="#4999ce">r</font>
</tt></h5>