http://www.perlmonks.org?node_id=118835

I've been looking at improving PMSI - Perl Monks Snippets Index, namely by fetching each individual snippet to acquire its date of creation, in order to create an index page of snippets per month or per year, as per stefan k's suggestion. So I whipped up a quick script using a LWP::UserAgent object with a callback to send the received content to be parsed on the fly by an HTML::Parser object.

It became apparent pretty quickly that the information I needed was in the first returned chunk. For the remaining chunks there was nothing left to do. (Of course, a future enhancement could be to count the number of follow-ups, but that's another story). It seemed to me that this was pretty inefficient, and an unnecessary drag on the sorely overloaded Monastery server.

So I started pondering how I could interrupt the download once I had received the information I needed. I wasn't sure that it was possible, but at least I had the source to hack in a solution if need be. I had visions of plumbing the depths of socket wizardry with a kluge of a global variable to take down the connection, and... um...

After spelunking around for a few minutes (by tracing where my callback was being passed), I came across LWP::Protocol::http which contains a sub named collect which does the deed of fetching the bytes (at least to as low a level as I cared about). There I found the following code, (which I've roughly paraphrased):

if($cb) { while ($content = &$collector, length $$content) { eval { &$cb($$content, $response, $self); }; if ($@) { chomp($@); $response->header('X-Died' => $@); last; } } }

There it was, all I had to was to die in my callback, and the connection would be cancelled. I hacked up the following code in about 10 minutes just to prove to myself that this was the case:

#! /usr/bin/perl -w # bloat.cgi use strict; print <<HEAD; Content-Type: text/html <html><head>bloat.cgi -- a humungous web page</head><body bgcolor="#ff +ffff"> HEAD print qq{<p class="foobar" align="right" name="$_">$_</p>\n} for( 1 .. + 10000 ); print '</body></html>'; __END__

(Note: I made up that really long <p> tag to see whether it was broken across chunk boundaries. If it is, HTML::Parser appears to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)

#! /usr/bin/perl -w use strict; use LWP::UserAgent; use HTTP::Request; use HTML::Parser; my $chunk = 0; my $p = HTML::Parser->new( start_h => [ \&begin, 'tagname,attr' ], default_h => [ \&content, 'text' ], end_h => [ \&end, 'tagname' ], ); my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(GET => 'http://localhost/cgi-bin/bloat.cg +i' ); my $res = $ua->request($req, \&cb); $p->eof; sub cb { my $received = shift; ++$chunk; $p->parse( $received ); } sub begin { my $element = shift; my $r = shift; print "received <$element"; print qq{ $_="$r->{$_}"} foreach keys %$r; print "> at chunk $chunk\n"; } sub content { my $content = shift; print "received [$content] at chunk $chunk\n"; ########################### die if $content eq '123'; # ########################### } sub end { my $element = shift; print "received </$element> at chunk $chunk\n"; } __END__

That seems to work pretty well. Checking the web server logs, I see the following lines appear:

127.0.0.1 - - [15/Oct/2001:10:40:14 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 527879 "-" "lwp-request/1.39" 127.0.0.1 - - [15/Oct/2001:10:40:18 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 116807 "-" "libwww-perl/5.53"

Quod erat demonstrandum. Don't let them take my Open Source away.

--
g r i n d e r