Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Fast fetching of HTML response code

by Corion (Patriarch)
on May 30, 2012 at 08:47 UTC ( [id://973224]=note: print w/replies, xml ) Need Help??


in reply to Fast fetching of HTML response code

I would look at AnyEvent::HTTP, which allows you to run lots (and lots) of HTTP requests in parallel. To reduce bandwidth on errors, you could first test with a HEAD request and follow up with a GET request on success. If you expect most URLs to be successfull and want the body then, I would just use a GET request.

For grouping all your requests, have a look at the ->begin method of CondVars in AnyEvent.

A rough, untested program could look something like this:

#!perl -w use strict; use AnyEvent; use AnyEvent::HTTP; my $done = AnyEvent->condvar; while (my $url = <DATA>) { $url =~ s!\s+$!!; $done->begin( sub { $_[0]->send } ); http_get $url, sub { my ($body, $headers) = @_; print "Retrieved $url ($headers->{Status})"; $done->end }; }; print "All requests sent. Waiting for responses.\n"; $done->recv; print "Done.\n"; __DATA__ http://localhost http://example.com http://google.in

Update: Remove whitespace at end of the URLs

Replies are listed 'Best First'.
Re^2: Fast fetching of HTML response code
by mrguy123 (Hermit) on May 30, 2012 at 09:06 UTC
    Hi Corion, thanks for the tip
    I am currently using Parallel::ForkManager to make the process faster but will also take a look at AnyEvent::HTTP
    The HEAD option, however, does not seem reduce to the time of response so I am looking for new ideas
    Guy

      You likely can't reduce the time per URL, but you can reduce the total time by making more requests in parallel. Or you can try to find out whether you are bandwidth-limited, or CPU-limited or whether the remote side is slow to respond. My guess is that the remote side is slow to respond.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://973224]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-04-18 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found