Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: Fast fetching of HTML response code

by mrguy123 (Hermit)
on May 30, 2012 at 08:59 UTC ( #973230=note: print w/ replies, xml ) Need Help??


in reply to Re: Fast fetching of HTML response code
in thread Fast fetching of HTML response code

Thanks for the advice, didn't know about the HEAD request
Unfortunately, the time it takes to get the HEAD is about the same time it takes for the regular GET request, so I'm assuming this isn't the bottleneck
Any other ideas?


Comment on Re^2: Fast fetching of HTML response code
Re^3: Fast fetching of HTML response code
by BrowserUk (Pope) on May 30, 2012 at 10:16 UTC
    the time it takes to get the HEAD is about the same time it takes for the regular GET request,

    That suggests that the time taken isn't the time required to transmit the page from the server to you; but rather the time it takes the server to prepare the page.

    It's a fact of life that with the preponderance of dynamically generated content being served these days, the difference between HEAD and GET requests is minimal. For the most part, servers treat HEAD requests as GET requests but then throw away the generated page and only return the headers.

    That means there is no way to what you asked for -- speed up the acquisition of the status code -- for individual pages.

    As Corion pointed out elsewhere; your best bet to reducing the overall runtime is to issue multiple concurrent GETs and so overlap the server and transmission times of those multiple GETs with your local processing of the responses.

    There are several ways of doing that. Corion suggested (one flavour of) the event-driven state machine method.

    Personally,

    • If I just needed to HEAD/GET a big list of urls, I'd use LWP::Parallel::UseraAgent so:
      #! perl -slw use strict; use Time::HiRes qw[ time ]; use LWP::Parallel::Useragent; use HTTP::Request; my $start = time; my $pua = LWP::Parallel::UserAgent->new(); $pua->timeout( 10 ); $pua->register( HTTP::Request->new( 'HEAD', "http://$_" ) ) while <>; my $entries = $pua->wait; printf "Took %.6f seconds\n", time - $start; __END__ c:\test>pua-head-urls urls.list Took 1333.616000 seconds
    • But if I need to do further processing on the responses, and especially if I needed to aggregate information from the responses together, then I'd use a pool-of-threads approach as I find it easier to reason about, easier to combine the results and it scales better (and automatically) to modern, multicore hardware.

      Here's a thread-pool implementation for reference:

      #! perl -slw use threads stack_size => 4096; use threads::shared; use Thread::Queue; $|++; our $THREADS //= 10; my $count :shared = 0; my %log :shared; my $Q = new Thread::Queue; my @threads = map async( sub { our $ua; require 'LWP/Simple.pm'; LWP::Simple->import( '$ua', 'head' ); $ua->timeout( 10 ); while( my $url = $Q->dequeue() ) { my $start = time; my @info = head( 'http://' . $url ); my $stop = time; lock %log; $log{ $url } = $stop - $start; lock $count; ++$count; } } ), 1 .. $THREADS; require 'Time/HiRes.pm'; Time::HiRes->import( qw[ time ] ); my $start = time; while( <> ) { chomp; Win32::Sleep 100 if $Q->pending > $THREADS; $Q->enqueue( $_ ); printf STDERR "\rProcessed $count urls"; } $Q->enqueue( (undef) x $THREADS ); printf STDERR "\rProcessed $count urls" while $Q->pending and Win32::S +leep 100; printf STDERR "\nTook %.6f with $THREADS threads\n", time() - $start; $_->join for @threads; my( @times, $url, $time ); push @times, [ $url, $time ] while ( $url, $time ) = each %log; @times = sort{ $b->[1] <=> $a->[1] } @times; print join ' ', @$_ for @times[ 0 .. 9 ]; __END__ c:\test>t-head-urls -THREADS=30 urls.list Processed 2596 Took 43.670000 with 30 threads

      More complex, but once you've reduce the overall runtime by overlapping the requests to the point where you saturate your connection bandwidth, then the time spent processing the responses locally starts to dominate.

      Then the threads solution starts to come into its own because it efficiently and automatically utilises however many cores and CPU cycles are available, dynamically and transparently adjusting itself to fluctuations in the availability of those resources.

      No other solution scales so easily, nor so effectively.

    But you'll have to make up your own mind which approach suits your application and environment best.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://973230]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2014-04-19 21:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (483 votes), past polls