|Just another Perl shrine|
LWP::UserAgent and HTML::Parser and the joys of Open Sourceby grinder (Bishop)
|on Oct 15, 2001 at 13:31 UTC||Need Help??|
I've been looking at improving PMSI - Perl Monks Snippets Index, namely by fetching each individual snippet to acquire its date of creation, in order to create an index page of snippets per month or per year, as per stefan k's suggestion. So I whipped up a quick script using a LWP::UserAgent object with a callback to send the received content to be parsed on the fly by an HTML::Parser object.
It became apparent pretty quickly that the information I needed was in the first returned chunk. For the remaining chunks there was nothing left to do. (Of course, a future enhancement could be to count the number of follow-ups, but that's another story). It seemed to me that this was pretty inefficient, and an unnecessary drag on the sorely overloaded Monastery server.
So I started pondering how I could interrupt the download once I had received the information I needed. I wasn't sure that it was possible, but at least I had the source to hack in a solution if need be. I had visions of plumbing the depths of socket wizardry with a kluge of a global variable to take down the connection, and... um...
After spelunking around for a few minutes (by tracing where my callback was being passed), I came across LWP::Protocol::http which contains a sub named collect which does the deed of fetching the bytes (at least to as low a level as I cared about). There I found the following code, (which I've roughly paraphrased):
There it was, all I had to was to die in my callback, and the connection would be cancelled. I hacked up the following code in about 10 minutes just to prove to myself that this was the case:
(Note: I made up that really long <p> tag to see whether it was broken across chunk boundaries. If it is, HTML::Parser appears to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)
That seems to work pretty well. Checking the web server logs, I see the following lines appear:
Quod erat demonstrandum. Don't let them take my Open Source away.--
g r i n d e r