http://www.perlmonks.org?node_id=1215779

As I get sucked deeper and deeper into web scrapers -- the Cosmo Cramers of our era -- and constantly doing so with my faithful companion, the LWP::UserAgent, the need arose, primarily out of curtesy to the hosts, for counting the number of requests (hits) I made over a certain time interval and holding the scraper back by sleep()ing some time.

Eventually, I decided I wanted to be able to know the ratio of active hitting sessions over sleep times and also control and tweak the hit rate and the subsequent burden on the host, for particular traffic situations: late night or noons, with just a few parameters, mainly the sleep() durations between the various phases of scraping and form filling. The latters, one could imagine being like a complex state machine which can lead you to deterministic -- most of the time -- but highly complex paths.

And so I have devised two methods/tools to assist me in my endeavours, one is a hit counter for LWP::UserAgent and the other is a counter of sleep() seconds which works across all sleep() calls even in far and foreign modules.

I will proceed now to lay out a module-based implementation of so-called UserAgent-with-Stats, including a test script.

The basic idea is to subclass LWP::UserAgent in order to add a handler (via set_handler), when requested by the user, to the "request_send" phase of LWP's request(). The purpose of this handler is to increment our internal hit counter every time a request is sent by LWP (GET/POST/etc.).

Additionally, there are two time counters to assist us in calculating the time-interval between when counter was turned on and either last-hit or when it was turned off. The aim is to be able to know the number of hits that occured within a time interval. Thinking about it maybe it makes more sense a time-first-hit to time-last-hit interval.

Now, one may ask why there is a need to subclass and not create a new class which takes a LWP::UserAgent object in adds handler to it and keeps the counters. Indeed, that is another possibility.

In any event, that's the basic idea. I would like to ask for your comments, corrections and recommendations. I will do the same for the sleep-count module in my next post.

UserAgentWithStats.pm

package UserAgentWithStats; + + =pod + #### ## author: bliako ## date: 03/06/2018 ## A subclass of LWP::UserAgent which adds a handler ## (if requested) to the "request_send" phase of LWP's request() ## the purpose of which is to increment our internal counter ## every time a request is sent by LWP. ## There are three counters here, one counts the hits and the ## other two record the time started (unix epoch) recording and tim,e +last-hit occured. ## The aim is to know the amount of hits and the time interval ## they occured in. Therefore each time a hit happens, the hit ## counter is incremented and "hit-count-time-last-hit" variable is ## updated with the current time(). There is also another time-keeping ## variable which is "hit-count-time-stopped". It records the time ## when the counter is turned off. So 2 time intervals for the same nu +mber ## of hits from turn-on to 1) time-last-hit and 2) time-stopped (or no +w if not stopped) ## #### use UserAgentWithStats; my $ua = UserAgentWithStats->new(); my $urlstr = "http://www.python.org"; $ua->hit_counter_on(); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; print "$0 : hitting site '$urlstr' ...\n"; my $aresponse = $ua->get($urlstr); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; my ($T1, $T2, $numhits) = @{$ua->hit_counter_statistics()}; print "$0 : resetting hit counter ...\n"; $ua->hit_counter_reset(); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; =cut use strict; use warnings; use parent 'LWP::UserAgent'; our $VERSION = '1.0'; sub new { my $class = $_[0]; my $params = $_[1]; my $parent = ( caller(1) )[3] || "N/A"; my $whoami = ( caller(0) )[3]; # call parent constructor my $self = $class->SUPER::new(); # extra attributes in class $self->{'ua-stats'} = { # UA can have callbacks defined so when a request is m +ade # a counter is hit. When doing hit counts we record th +etime hit count was turned on too 'hit-count-time-started' => -1, # unix epoch, 'hit-count-time-stopped' => -1, # ditto 'hit-count-time-last-hit' => -1, # ditto 'hit-count' => 0, }; bless($self, $class); return $self; } # increment a count each time a request is made, can also record the u +rl etc. sub hit_counter_on { my $self = $_[0]; $self->add_handler( 'request_send', sub { my($response, $ua, $h) = @_; $self->register_a_hit(); return undef # we bloody need this }, ('owner' => 'hit_counter_on') # use this id for when r +emoving it ); # reset previous counter and set the time when recording start +ed $self->hit_counter_reset(); } # increments the hit counter by 1 sub increment_hit_count { $_[0]->{'ua-stats'}->{'ua-hit-count'} += + 1 } # registers a hit meaning that hit counter is incremented by 1 and # last time a hit happens becomes current time. sub register_a_hit { my $self = $_[0]; $self->increment_hit_count(); $self->{'ua-stats'}->{'hit-count-time-last-hit'} = time; } sub hit_counter_reset { my $self = $_[0]; $self->{'ua-stats'}->{'ua-hit-count'} = 0; $self->{'ua-stats'}->{'hit-count-time-started'} = time; $self->{'ua-stats'}->{'hit-count-time-stopped'} = -1; $self->{'ua-stats'}->{'hit-count-time-last-hit'} = -1; } sub hit_counter_off { my $self = $_[0]; $self->remove_handler( 'request_send', # phase we set it in ('owner' => 'hit_counter_on') # our id to remove ); $self->{'ua-stats'}->{'hit-count-time-stopped'} = time; } sub hit_count { return $_[0]->{'ua-stats'}->{'ua-hit-count'} } sub time_interval_to_last_hit { my $self = $_[0]; return $self->{'ua-stats'}->{'hit-count-time-last-hit'} == -1 ? # no hits yet 0 : #( # hits recorded $self->{'ua-stats'}->{'hit-count-time-last-hit'} - $se +lf->{'ua-stats'}->{'hit-count-time-started'} ; #) } sub time_interval_to_now_or_when_stopped { my $self = $_[0]; return $self->{'ua-stats'}->{'hit-count-time-stopped'} == -1 ? # if hit-counting is still on, then time interval is u +p to now time - $self->{'ua-stats'}->{'hit-count-time-started'} : #( # else hit-counting was turned off, so give last time +interval $self->{'ua-stats'}->{'hit-count-time-stopped'} - $sel +f->{'ua-stats'}->{'hit-count-time-started'} ; #) } # returns an arrayref of [TimeTimerval, Hits] # see below for a string equivalent of this sub hit_counter_statistics { my $self = $_[0]; return [ $self->time_interval_to_last_hit(), $self->time_interval_to_now_or_when_stopped(), $self->hit_count() ] } sub hit_counter_statistics_toString { my ($T1, $T2, $H) = @{$_[0]->hit_counter_statistics()}; return "$H hits over $T1 s (to last hit) or over $T2 s (to now +/when stopped) (" .sprintf("%.2f", 3600*$H/($T1==0?($T1+1):$T1)) ." or ".sprintf("%.2f", 3600*$H/($T2==0?($T2+1):$T2)) ." hits/hour)" ; } 1; __END__

And here is a test script:

#!/usr/bin/env perl use strict; use warnings; use UserAgentWithStats; use Test::More; my $ua = UserAgentWithStats->new(); my $urlstr = "http://www.python.org"; my $num_tests = 0; print "$0 : turning hit counter on ...\n"; $ua->hit_counter_on(); sleep(1); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; my ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistic +s()}; ok(($intervalT1==0), "time intervals (up-to-last-hit) must be zero on +turn-on ($intervalT1)"); $num_tests++; ok(($intervalT2>0), "time intervals (up-to-stopped) records since turn +-on irrespective of hits ($intervalT2)"); $num_tests++; is($numhits, 0, "0 hits at turn-on"); $num_tests++; print "$0 : hitting site with GET '$urlstr' ...\n"; my $aresponse = $ua->get($urlstr); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistics() +}; ok(($intervalT1>0)&&($intervalT1<20), "time interval (up-to-stopped) m +ust be positive (let's say 1-20 seconds)."); $num_tests++; ok(($intervalT2>0)&&($intervalT2<20), "time interval (up-to-last-hit) +must be positive (let's say 1-20 seconds)."); $num_tests++; is($numhits, 2, "2 hits because of a redirect"); $num_tests++; print "$0 : turning hit counter off ...\n"; $ua->hit_counter_off(); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; print "$0 : hitting site with GET '$urlstr' ...\n"; $aresponse = $ua->get($urlstr); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; my ($interval2T1, $interval2T2, $numhits2) = @{$ua->hit_counter_statis +tics()}; is($interval2T1, $intervalT1, "no change in time interval because coun +ter is off"); $num_tests++; is($interval2T2, $intervalT2, "no change in time interval because coun +ter is off"); $num_tests++; is($numhits2, $numhits, "no change in hits because counter is off"); $num_tests++; print "$0 : turning hit counter back on again ...\n"; $ua->hit_counter_on(); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; is($ua->hit_count(), 0, "hit count after starting"); $num_tests++; sleep(1); ok($ua->time_interval_to_now_or_when_stopped()>0, "time interval to-no +w after starting must be positive integer (after slept for 1)"); $num_tests++; is($ua->time_interval_to_last_hit(), 0, "time interval since last hit +must be zero, no hits yet"); $num_tests++; print "$0 : hitting site with GET '$urlstr' ...\n"; $aresponse = $ua->get($urlstr); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistics() +}; ok(($intervalT1>0)&&($intervalT1<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; ok(($intervalT2>0)&&($intervalT2<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; is($numhits, 2, "2 hits because of a redirect"); $num_tests++; print "$0 : hitting site with GET '$urlstr' again ...\n"; $aresponse = $ua->get($urlstr); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistics() +}; ok(($intervalT1>0)&&($intervalT1<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; ok(($intervalT2>0)&&($intervalT2<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; is($numhits, 4, "2 more hits because of a redirect"); $num_tests++; print "$0 : resetting hit counter ...\n"; $ua->hit_counter_reset(); print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistics() +}; is($intervalT1, 0, "time interval must be zero after reset"); $num_tests++; is($intervalT2, 0, "time interval must be zero after reset"); $num_tests++; is($numhits, 0, "number of hits must be zero after reset"); $num_tests++; sleep(1); $urlstr = 'https://www.w3schools.com/action_page.php'; my $form = {'fname' => 'abc', 'lname' => 'fool on indenting'}; print "$0 : hitting site with a POST '$urlstr' ...\n"; $aresponse = $ua->post($urlstr, $form); if ($aresponse->is_success) { print "$0 : success hitting $urlstr\n"; } else { die "$urlstr : $aresponse->status_line"; } print "$0 : hit count is now: ".$ua->hit_counter_statistics_toString() +."\n"; ($intervalT1, $intervalT2, $numhits) = @{$ua->hit_counter_statistics() +}; ok(($intervalT1>0)&&($intervalT1<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; ok(($intervalT2>0)&&($intervalT2<20), "time interval must be positive +(let's say 1-20 seconds)."); $num_tests++; is($numhits, 1, "1 hit this time..."); $num_tests++; done_testing($num_tests); print "$0 : done.\n";

I will detail the sleep-count in my next post.

Thanks, bliako

Replies are listed 'Best First'.
Re: RFC: LWP::UserAgent hit counter
by Anonymous Monk on Jun 04, 2018 at 02:37 UTC

      Hi, this is interesting, thanks!

      That said let me clarify a bit more on my situation: i prefer to handle the throttle myself. For example, sometimes I will get a server timeout, in which case i will repeat my hit but after sleeping for some longish time (because I know that they are probably doing a backup as it occurs at more or less the same time). Whereas normally I sleep for shorter times in a loop. Some pages I access less often and I would loop over them with a very small sleep value, some other pages I access more frequently and the sleep time must be longer.

      Most importantly, I need my sleeps to be variable, seemingly random. Right now, they come out from a random distribution with a mean and a standard deviation which I control.

      From the source code of the package you mentioned it looks that it overrides the send_request() method so that it sleeps for a FIXED amount of time and then it does the request. The throttle value (sleep seconds) can be replaced by a throttle function which returns a random number of seconds to sleep drawn from a statistical distribution. That can be useful. However, my need for different throttles on different situations (i.e. GET/POST requests to the same site and not just different websites) still exists.

Re: RFC: LWP::UserAgent hit counter
by Aldebaran (Curate) on Jul 06, 2018 at 19:37 UTC

    How typical is this output? I got it do be different than this once.

    $ ./1.ua.pl ./1.ua.pl : turning hit counter on ... ./1.ua.pl : hit count is now: 0 hits over 0 s (to last hit) or over 1 +s (to now/when stopped) (0.00 or 0.00 hits/hour) ok 1 - time intervals (up-to-last-hit) must be zero on turn-on (0) ok 2 - time intervals (up-to-stopped) records since turn-on irrespecti +ve of hits (1) ok 3 - 0 hits at turn-on ./1.ua.pl : hitting site with GET 'http://www.python.org' ... ./1.ua.pl : success hitting http://www.python.org ./1.ua.pl : hit count is now: 2 hits over 2 s (to last hit) or over 3 +s (to now/when stopped) (3600.00 or 2400.00 hits/hour) ok 4 - time interval (up-to-stopped) must be positive (let's say 1-20 +seconds). ok 5 - time interval (up-to-last-hit) must be positive (let's say 1-20 + seconds). ok 6 - 2 hits because of a redirect ./1.ua.pl : turning hit counter off ... ./1.ua.pl : hit count is now: 2 hits over 2 s (to last hit) or over 3 +s (to now/when stopped) (3600.00 or 2400.00 hits/hour) ./1.ua.pl : hitting site with GET 'http://www.python.org' ... ./1.ua.pl : success hitting http://www.python.org ./1.ua.pl : hit count is now: 2 hits over 2 s (to last hit) or over 3 +s (to now/when stopped) (3600.00 or 2400.00 hits/hour) ok 7 - no change in time interval because counter is off ok 8 - no change in time interval because counter is off ok 9 - no change in hits because counter is off ./1.ua.pl : turning hit counter back on again ... ./1.ua.pl : hit count is now: 0 hits over 0 s (to last hit) or over 0 +s (to now/when stopped) (0.00 or 0.00 hits/hour) ok 10 - hit count after starting ok 11 - time interval to-now after starting must be positive integer ( +after slept for 1) ok 12 - time interval since last hit must be zero, no hits yet ./1.ua.pl : hitting site with GET 'http://www.python.org' ... ./1.ua.pl : success hitting http://www.python.org ./1.ua.pl : hit count is now: 2 hits over 2 s (to last hit) or over 3 +s (to now/when stopped) (3600.00 or 2400.00 hits/hour) ok 13 - time interval must be positive (let's say 1-20 seconds). ok 14 - time interval must be positive (let's say 1-20 seconds). ok 15 - 2 hits because of a redirect ./1.ua.pl : hitting site with GET 'http://www.python.org' again ... ./1.ua.pl : success hitting http://www.python.org ./1.ua.pl : hit count is now: 4 hits over 8 s (to last hit) or over 9 +s (to now/when stopped) (1800.00 or 1600.00 hits/hour) ok 16 - time interval must be positive (let's say 1-20 seconds). ok 17 - time interval must be positive (let's say 1-20 seconds). ok 18 - 2 more hits because of a redirect ./1.ua.pl : resetting hit counter ... ./1.ua.pl : hit count is now: 0 hits over 0 s (to last hit) or over 0 +s (to now/when stopped) (0.00 or 0.00 hits/hour) ok 19 - time interval must be zero after reset ok 20 - time interval must be zero after reset ok 21 - number of hits must be zero after reset ./1.ua.pl : hitting site with a POST 'https://www.w3schools.com/action +_page.php' ... ./1.ua.pl : success hitting https://www.w3schools.com/action_page.php ./1.ua.pl : hit count is now: 1 hits over 1 s (to last hit) or over 1 +s (to now/when stopped) (3600.00 or 3600.00 hits/hour) ok 22 - time interval must be positive (let's say 1-20 seconds). ok 23 - time interval must be positive (let's say 1-20 seconds). ok 24 - 1 hit this time... 1..24 ./1.ua.pl : done. $

    I wonder how this output might change with different sites and different forms.

      ok 16 - time interval must be positive (let's say 1-20 seconds).
      

      test script allows for some long response time, it does not test the exact amount of seconds but rather the exact amount of hits over some reasonable time interval.