Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

conditional testing for error 500 webpages before following them?

by fraizerangus (Sexton)
on Oct 15, 2011 at 13:09 UTC ( #931652=perlquestion: print w/ replies, xml ) Need Help??
fraizerangus has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

The website I am trying to scrape from has some links which can't be followed because of a server issue, when I iterate through the links on the page the prgram crashes because of these 'down links'. What is the best way of testing these links first before I follow them and extract the URL?

An example of teh doen link is as follows: http://www.molmovdb.org/cgi-bin/motion.cgi?ID=ppar

Would I need Test::WWW::Mechanize to test before following? Also is it possible to iterate the get($url) in a loop for the links, everything I've tried so far does'nt allow me to do so, it wants an absolute URL?

Using the following code:

#!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi/ + ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; $mech_cgi->back; }

many thanks and best wishes

Dan

Comment on conditional testing for error 500 webpages before following them?
Download Code
Re: conditional testing for error 500 webpages before following them?
by roboticus (Canon) on Oct 15, 2011 at 14:09 UTC

    fraizerangus:

    Crashes? You've got a die statement in there. Perhaps you'd be better served by using print instead!

    Of course, this is only a guess, since you didn't give the error message you receive. But you definitely don't want to die if you want to continue after errors.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: conditional testing for error 500 webpages before following them?
by Perlbotics (Abbot) on Oct 15, 2011 at 14:19 UTC

    From WWW::Mechanize:

    autocheck => [0|1]

    Checks each request made to see if it was successful. This saves you the trouble of manually checking yourself. Any errors found are errors, not warnings.

    The default value is ON, unless it's being subclassed, in which case it is OFF. This means that standalone WWW::Mechanizeinstances have autocheck turned on, which is protective for the vast majority of Mech users who don't bother checking the return value of get() and post() and can't figure why their code fails. However, if WWW::Mechanize is subclassed, such as for Test::WWW::Mechanize or Test::WWW::Mechanize::Catalyst, this may not be an appropriate default, so it's off.
    Here, errors means die(). Since your program used autocheck=>1 by default, it dies when a problem occurs while calling $mech_cgi->follow_link(...). It has no chance to reach your own call to die - which is not what you want - as already observed by roboticus (Updated: paragraph).

    Now you have at least two options:

    • Wrap the calls that can fail into an eval-block and check for exceptions ($@), or
    • create the WWW::Mechanize object using ...new( autocheck=>0 ) and check the results (see HTTP::Response) of $mech_cgi-calls for problems.

    Example (2nd alternative):

    use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new( autocheck => 0 ); $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi/ + ); for my $link ( @cgi_links ) { # no C-style loop... print "following link: ", $link->url, "\n"; my $res = $mech_cgi->follow_link( url => $link->url ); # $res is a HTTP::Response object if ( $res->is_success ) { print "OK : Processing result ...\n"; } else { print "ERR: Failed to retrieve page: ", $res->status_line, "\n"; } $mech_cgi->back; sleep 5; # anti-aggressive scraping }

    Result:

    ... following link: http://www.molmovdb.org/cgi-bin/motion.cgi?ID=ntrc OK : Processing result ... following link: http://www.molmovdb.org/cgi-bin/motion.cgi?ID=ppar ERR: Failed to retrieve page: 500 Internal Server Error following link: http://www.molmovdb.org/cgi-bin/motion.cgi?ID=rhorbp OK : Processing result ... ...

    Please check also if you have permission to scrape this site.

Re: conditional testing for error 500 webpages before following them?
by Anonymous Monk on Oct 15, 2011 at 14:47 UTC
Re: conditional testing for error 500 webpages before following them?
by Marshall (Prior) on Oct 15, 2011 at 16:18 UTC
    One thing to consider with scraping webpages, whether you use mechanize or not. Is that many servers just "barf" on some requests. That's what is it - I try to design my servers so that they don't do that - but not every body does.

    When you are the human, you click again and this one out of 2,000 requests just doesn't even register in your brain. If I have have to run 8,000 requests, then it matters...

    Here is some code that you can adapt:

    You should do a "retry" before deciding that this is a "dead link". I show one way below. This server barfs with error 500 or whatever about 1/2000 requests.

    The RETRY skips the (while) statement and continues on with a new GET. I don't bother to "skip around" the "clean-up" code before the GET because it runs really fast and again only happens 1/2000 times.

    Hope this idea helps you. This is real world stuff that does happen. I sleep a little bit to "be nice". This code works with a "paid subscription" and I am not as "nice" as I would be if this was a free interface. but even so I am a little nice when the server "barfs".

    The main point here is the use of RETRY: (which is my label) and redo which is the Perl keyword.

    RETRY: while (my $n_attempt=0, my $callsign=<>) { $callsign = uc($callsign); # uppercase $callsign =~ s/^\s*//; # no leading spaces $callsign =~ s/\s*$//; # no trailing spaces does chomp() also +.. next if $callsign eq ""; # skip NULL (blank lines)! my $callsign = (split(/,/,$callsign))[0]; #allow histogram format #w6oat,234 or just w6oat next if ($callsign =~ /^[a-zA-Z]\d{1}[a-zA-Z]$/); # like N7A # NO PROCESSING OF 1X1 US CALLSIGNS!!! print STDERR "working on $callsign\n" if DEBUG; my $req = GET "http://www.qrz.com/xml?s=$key;callsign=$callsign"; my $res = $ua->request($req); unless ($res->is_success) { $n_attempt++; print STDERR "$callsign ERROR: Try# $n_attempt of ".MAX_RETRY. " err:". $res->status_line ."\n"; sleep(1); redo RETRY if $n_attempt <= MAX_RETRY; print STDERR "$callsign ERROR: Try# $n_attempt of ". MAX_RETRY." FAILED: ". $res->status_line . "\n"; next; # skip this callsign and go to the next one. # This ain't gonna happen unless the QRZ server is # down. "if ($res->is_success)" means we got some kind # of response from the server. The QRZ server will # barf on about 1/2000 requests, hence the retries. + }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://931652]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-08-31 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (294 votes), past polls