Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I've been looking at improving PMSI - Perl Monks Snippets Index, namely by fetching each individual snippet to acquire its date of creation, in order to create an index page of snippets per month or per year, as per stefan k's suggestion. So I whipped up a quick script using a LWP::UserAgent object with a callback to send the received content to be parsed on the fly by an HTML::Parser object.

It became apparent pretty quickly that the information I needed was in the first returned chunk. For the remaining chunks there was nothing left to do. (Of course, a future enhancement could be to count the number of follow-ups, but that's another story). It seemed to me that this was pretty inefficient, and an unnecessary drag on the sorely overloaded Monastery server.

So I started pondering how I could interrupt the download once I had received the information I needed. I wasn't sure that it was possible, but at least I had the source to hack in a solution if need be. I had visions of plumbing the depths of socket wizardry with a kluge of a global variable to take down the connection, and... um...

After spelunking around for a few minutes (by tracing where my callback was being passed), I came across LWP::Protocol::http which contains a sub named collect which does the deed of fetching the bytes (at least to as low a level as I cared about). There I found the following code, (which I've roughly paraphrased):

if($cb) { while ($content = &$collector, length $$content) { eval { &$cb($$content, $response, $self); }; if ($@) { chomp($@); $response->header('X-Died' => $@); last; } } }

There it was, all I had to was to die in my callback, and the connection would be cancelled. I hacked up the following code in about 10 minutes just to prove to myself that this was the case:

#! /usr/bin/perl -w # bloat.cgi use strict; print <<HEAD; Content-Type: text/html <html><head>bloat.cgi -- a humungous web page</head><body bgcolor="#ff +ffff"> HEAD print qq{<p class="foobar" align="right" name="$_">$_</p>\n} for( 1 .. + 10000 ); print '</body></html>'; __END__

(Note: I made up that really long <p> tag to see whether it was broken across chunk boundaries. If it is, HTML::Parser appears to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)

#! /usr/bin/perl -w use strict; use LWP::UserAgent; use HTTP::Request; use HTML::Parser; my $chunk = 0; my $p = HTML::Parser->new( start_h => [ \&begin, 'tagname,attr' ], default_h => [ \&content, 'text' ], end_h => [ \&end, 'tagname' ], ); my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(GET => 'http://localhost/cgi-bin/bloat.cg +i' ); my $res = $ua->request($req, \&cb); $p->eof; sub cb { my $received = shift; ++$chunk; $p->parse( $received ); } sub begin { my $element = shift; my $r = shift; print "received <$element"; print qq{ $_="$r->{$_}"} foreach keys %$r; print "> at chunk $chunk\n"; } sub content { my $content = shift; print "received [$content] at chunk $chunk\n"; ########################### die if $content eq '123'; # ########################### } sub end { my $element = shift; print "received </$element> at chunk $chunk\n"; } __END__

That seems to work pretty well. Checking the web server logs, I see the following lines appear:

127.0.0.1 - - [15/Oct/2001:10:40:14 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 527879 "-" "lwp-request/1.39" 127.0.0.1 - - [15/Oct/2001:10:40:18 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 116807 "-" "libwww-perl/5.53"

Quod erat demonstrandum. Don't let them take my Open Source away.

--
g r i n d e r

In reply to LWP::UserAgent and HTML::Parser and the joys of Open Source by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-23 16:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found