Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Fast fetching of HTML response code

by mrguy123 (Hermit)
on May 30, 2012 at 08:28 UTC ( #973220=perlquestion: print w/ replies, xml ) Need Help??
mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks
I am writing a tool that checks if links are working or not. The idea is to fetch the link, check the error code, and if the error code is valid do a few more tests on the HTML response page
I was wondering, is there a way to quickly fetch only the response code without the rest of the HTML data? This means that if the code is bad (e.g. 401) I can move on to the next link, but if it is OK I can fetch the rest of the data for further testing.
For example, if I run the code below
use strict; use warnings; use LWP::UserAgent; use Time::HiRes; { my $ua = new LWP::UserAgent(); my $search_address = "http://ejournals.ebsco.com/direct.asp?Journa +lID=101503"; my $req = new HTTP::Request ('GET',$search_address); my $start = [ Time::HiRes::gettimeofday( ) ]; ##Get the response object my $res = $ua->request($req); ##Get the response time and return code my $diff = Time::HiRes::tv_interval( $start ); my $code = $res->code(); print "Code $code fetched in $diff seconds\n"; }
It takes me about 1.5 seconds to get the response code (403). If I can somehow get it faster, it will make a big difference when I am testing 1000s of links
So, do you think this is even possible or just wishful thinking on my behalf?

Thanks, Mister Guy
Note: I am using LWP::UserAgent for the testing but can also use other modules if necessary
UPDATE: Used HEAD instead of GET for the HTTP request but the response time didn't improve
UPDATE 2: Used HEAD as compared to GET in 50 different links, and in some of them the HEAD request was indeed faster. Therefore the way to go is to use HEAD and of course parallel your processes if you want faster link checking. Thanks for the help

Hardware: The parts of a computer system that can be kicked.

Comment on Fast fetching of HTML response code
Download Code
Re: Fast fetching of HTML response code
by Anonymous Monk on May 30, 2012 at 08:41 UTC
Re: Fast fetching of HTML response code
by BrowserUk (Pope) on May 30, 2012 at 08:42 UTC

    Use a HEAD request instead of a GET request. From the HTML spec:

    The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thanks for the advice, didn't know about the HEAD request
      Unfortunately, the time it takes to get the HEAD is about the same time it takes for the regular GET request, so I'm assuming this isn't the bottleneck
      Any other ideas?
        the time it takes to get the HEAD is about the same time it takes for the regular GET request,

        That suggests that the time taken isn't the time required to transmit the page from the server to you; but rather the time it takes the server to prepare the page.

        It's a fact of life that with the preponderance of dynamically generated content being served these days, the difference between HEAD and GET requests is minimal. For the most part, servers treat HEAD requests as GET requests but then throw away the generated page and only return the headers.

        That means there is no way to what you asked for -- speed up the acquisition of the status code -- for individual pages.

        As Corion pointed out elsewhere; your best bet to reducing the overall runtime is to issue multiple concurrent GETs and so overlap the server and transmission times of those multiple GETs with your local processing of the responses.

        There are several ways of doing that. Corion suggested (one flavour of) the event-driven state machine method.

        Personally,

        • If I just needed to HEAD/GET a big list of urls, I'd use LWP::Parallel::UseraAgent so:
          #! perl -slw use strict; use Time::HiRes qw[ time ]; use LWP::Parallel::Useragent; use HTTP::Request; my $start = time; my $pua = LWP::Parallel::UserAgent->new(); $pua->timeout( 10 ); $pua->register( HTTP::Request->new( 'HEAD', "http://$_" ) ) while <>; my $entries = $pua->wait; printf "Took %.6f seconds\n", time - $start; __END__ c:\test>pua-head-urls urls.list Took 1333.616000 seconds
        • But if I need to do further processing on the responses, and especially if I needed to aggregate information from the responses together, then I'd use a pool-of-threads approach as I find it easier to reason about, easier to combine the results and it scales better (and automatically) to modern, multicore hardware.

          Here's a thread-pool implementation for reference:

          #! perl -slw use threads stack_size => 4096; use threads::shared; use Thread::Queue; $|++; our $THREADS //= 10; my $count :shared = 0; my %log :shared; my $Q = new Thread::Queue; my @threads = map async( sub { our $ua; require 'LWP/Simple.pm'; LWP::Simple->import( '$ua', 'head' ); $ua->timeout( 10 ); while( my $url = $Q->dequeue() ) { my $start = time; my @info = head( 'http://' . $url ); my $stop = time; lock %log; $log{ $url } = $stop - $start; lock $count; ++$count; } } ), 1 .. $THREADS; require 'Time/HiRes.pm'; Time::HiRes->import( qw[ time ] ); my $start = time; while( <> ) { chomp; Win32::Sleep 100 if $Q->pending > $THREADS; $Q->enqueue( $_ ); printf STDERR "\rProcessed $count urls"; } $Q->enqueue( (undef) x $THREADS ); printf STDERR "\rProcessed $count urls" while $Q->pending and Win32::S +leep 100; printf STDERR "\nTook %.6f with $THREADS threads\n", time() - $start; $_->join for @threads; my( @times, $url, $time ); push @times, [ $url, $time ] while ( $url, $time ) = each %log; @times = sort{ $b->[1] <=> $a->[1] } @times; print join ' ', @$_ for @times[ 0 .. 9 ]; __END__ c:\test>t-head-urls -THREADS=30 urls.list Processed 2596 Took 43.670000 with 30 threads

          More complex, but once you've reduce the overall runtime by overlapping the requests to the point where you saturate your connection bandwidth, then the time spent processing the responses locally starts to dominate.

          Then the threads solution starts to come into its own because it efficiently and automatically utilises however many cores and CPU cycles are available, dynamically and transparently adjusting itself to fluctuations in the availability of those resources.

          No other solution scales so easily, nor so effectively.

        But you'll have to make up your own mind which approach suits your application and environment best.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Fast fetching of HTML response code
by Corion (Pope) on May 30, 2012 at 08:47 UTC

    I would look at AnyEvent::HTTP, which allows you to run lots (and lots) of HTTP requests in parallel. To reduce bandwidth on errors, you could first test with a HEAD request and follow up with a GET request on success. If you expect most URLs to be successfull and want the body then, I would just use a GET request.

    For grouping all your requests, have a look at the ->begin method of CondVars in AnyEvent.

    A rough, untested program could look something like this:

    #!perl -w use strict; use AnyEvent; use AnyEvent::HTTP; my $done = AnyEvent->condvar; while (my $url = <DATA>) { $url =~ s!\s+$!!; $done->begin( sub { $_[0]->send } ); http_get $url, sub { my ($body, $headers) = @_; print "Retrieved $url ($headers->{Status})"; $done->end }; }; print "All requests sent. Waiting for responses.\n"; $done->recv; print "Done.\n"; __DATA__ http://localhost http://example.com http://google.in

    Update: Remove whitespace at end of the URLs

      Hi Corion, thanks for the tip
      I am currently using Parallel::ForkManager to make the process faster but will also take a look at AnyEvent::HTTP
      The HEAD option, however, does not seem reduce to the time of response so I am looking for new ideas
      Guy

        You likely can't reduce the time per URL, but you can reduce the total time by making more requests in parallel. Or you can try to find out whether you are bandwidth-limited, or CPU-limited or whether the remote side is slow to respond. My guess is that the remote side is slow to respond.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://973220]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (15)
As of 2014-10-30 18:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (208 votes), past polls