Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I've been looking at improving PMSI - Perl Monks Snippets Index, namely by fetching each individual snippet to acquire its date of creation, in order to create an index page of snippets per month or per year, as per stefan k's suggestion. So I whipped up a quick script using a LWP::UserAgent object with a callback to send the received content to be parsed on the fly by an HTML::Parser object.

It became apparent pretty quickly that the information I needed was in the first returned chunk. For the remaining chunks there was nothing left to do. (Of course, a future enhancement could be to count the number of follow-ups, but that's another story). It seemed to me that this was pretty inefficient, and an unnecessary drag on the sorely overloaded Monastery server.

So I started pondering how I could interrupt the download once I had received the information I needed. I wasn't sure that it was possible, but at least I had the source to hack in a solution if need be. I had visions of plumbing the depths of socket wizardry with a kluge of a global variable to take down the connection, and... um...

After spelunking around for a few minutes (by tracing where my callback was being passed), I came across LWP::Protocol::http which contains a sub named collect which does the deed of fetching the bytes (at least to as low a level as I cared about). There I found the following code, (which I've roughly paraphrased):

if($cb) { while ($content = &$collector, length $$content) { eval { &$cb($$content, $response, $self); }; if ($@) { chomp($@); $response->header('X-Died' => $@); last; } } }

There it was, all I had to was to die in my callback, and the connection would be cancelled. I hacked up the following code in about 10 minutes just to prove to myself that this was the case:

#! /usr/bin/perl -w # bloat.cgi use strict; print <<HEAD; Content-Type: text/html <html><head>bloat.cgi -- a humungous web page</head><body bgcolor="#ff +ffff"> HEAD print qq{<p class="foobar" align="right" name="$_">$_</p>\n} for( 1 .. + 10000 ); print '</body></html>'; __END__

(Note: I made up that really long <p> tag to see whether it was broken across chunk boundaries. If it is, HTML::Parser appears to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)

#! /usr/bin/perl -w use strict; use LWP::UserAgent; use HTTP::Request; use HTML::Parser; my $chunk = 0; my $p = HTML::Parser->new( start_h => [ \&begin, 'tagname,attr' ], default_h => [ \&content, 'text' ], end_h => [ \&end, 'tagname' ], ); my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(GET => 'http://localhost/cgi-bin/bloat.cg +i' ); my $res = $ua->request($req, \&cb); $p->eof; sub cb { my $received = shift; ++$chunk; $p->parse( $received ); } sub begin { my $element = shift; my $r = shift; print "received <$element"; print qq{ $_="$r->{$_}"} foreach keys %$r; print "> at chunk $chunk\n"; } sub content { my $content = shift; print "received [$content] at chunk $chunk\n"; ########################### die if $content eq '123'; # ########################### } sub end { my $element = shift; print "received </$element> at chunk $chunk\n"; } __END__

That seems to work pretty well. Checking the web server logs, I see the following lines appear:

127.0.0.1 - - [15/Oct/2001:10:40:14 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 527879 "-" "lwp-request/1.39" 127.0.0.1 - - [15/Oct/2001:10:40:18 +0200] "GET /cgi-bin/bloat.cgi HTT +P/1.0" 200 116807 "-" "libwww-perl/5.53"

Quod erat demonstrandum. Don't let them take my Open Source away.

--
g r i n d e r

In reply to LWP::UserAgent and HTML::Parser and the joys of Open Source by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (4)
    As of 2014-09-21 20:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (175 votes), past polls