Verifying external web links

nu2perl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Verifying external web links by Masem (Monsignor) on Dec 05, 2001 at 18:58 UTC
As suggested already LWP is your solution. However, I will point out that any solution should not be a 'try once and fail', but should instead be along the lines of '3 strikes and then fail'. That is, with the connectivity of the internet today, while most major commercial sites are up 99.9+% of the time, many off beat sites will sometimes be inaccessable due to lower-grade ISP (eg dealing with residental broadband). These sites might not be up at the time you try them, but maybe 2 mins, 2 hours, or 2 days later they will be. The best way to do link checking is to test a site; if not there, try it again the next day, then the next week, and then possibly the week after that, ideally at sufficiently different times of the day (midnight, 6a, noon, 6p). This should cover things like DNS resolution issues, network outages, and equipment replacements that might occur. If a site fails all 3 or 4 times, then it's probably gone. ----------------------------------------------------- Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain "I can see my house from here!" It's not what you know, but knowing how to find it if you don't know that's important	[reply]
Re: Re: Verifying external web links by Fastolfe (Vicar) on Dec 06, 2001 at 01:03 UTC
Whether or not you retry should depend on the nature of the failure. If you get back a 400- or 500-series response, you should generally stop there, since the server has pretty much stated, "No way, no how." A possible exception to this would be a 408 (timeout) response and arguably 500, since it's possible the error is temporary. If, on the other hand, (like you discuss), the request fails due to a connection problem (connection refused, timed out, no route to host), I might wait a bit (hours? days?) and try again.	[reply]
Re: Re: Re: Verifying external web links by Masem (Monsignor) on Dec 06, 2001 at 01:24 UTC
I'd argue that 404 should be rechecked too, though most likely, any site that starts off with a 404 error will end up off the list, more so than 408s, 500s, or connection problems. Sometimes, if you've linked 'deep' into a site (anywhere off the front page, or in a user's account), the server's storage might be switched around, and in a short time frame, you might get 404s, but outside, the page would be accessible normally. There's other reasons that I can think of as well, which are not unlikely but are uncommon, that I'd check pages repeatedly regardless of error. That said, it certainly would not be too hard with such a tool to report in a log file why links were removed, allowing for the person to chase down those that might be recoverable (404s commonly), as opposed to those that are probably lost for good (no connection over serveral attempts). ----------------------------------------------------- Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain "I can see my house from here!" It's not what you know, but knowing how to find it if you don't know that's important	[reply]
(RhetTbull) Re: Verifying external web links by RhetTbull (Curate) on Dec 05, 2001 at 19:38 UTC
Didn't merlyn already write an article or two or three on that?	[reply]
Re: Re: Verifying external web links by nu2perl (Initiate) on Dec 05, 2001 at 21:08 UTC
First off, thank you (and everyone else) for taking the time to respond to my question. yes, indeed you are right. Merlyn did write an article about it (Web Techniques, Oct 1996). What confused/intimidated me was a line that said "Line 3 extends the built in library search path to include the location of my locally downloaded CPAN items". Please do not flame me for my ignorance. Installing LWP seemed to have several dependencies itself. If only the required libraries were listed, it would have been easier. I'm not looking to be spoonfed the answer, but these sorts of ambiguities make it easy for a task like to slide to the back burner.	[reply]
Re: Verifying external web links by CubicSpline (Friar) on Dec 05, 2001 at 18:39 UTC
I do the same sort of thing and I've just used the LWP::UserAgent and HTTP::Request::Common modules to request the external resource and check the HTTP headers for 404. Would something like that work for you? ~CS	[reply]
Re: Verifying external web links by Lucky (Scribe) on Dec 05, 2001 at 19:23 UTC
Simple solution. Not comprehensive, but it works without LWP. `use strict; use IO::Socket::INET; for (@ARGV){ s\|http://\|\|; m\|([^/]+)(.*)\|; my $s=IO::Socket::INET->new(PeerAddr=>$1,PeerPort=>80,Proto=>'tcp',Ty +pe=>SOCK_STREAM); print $s "GET ".($2\|\|'/')." HTTP/1.0\nHost: $1 \n\n"; print "Link $_ is validated\n" if <$s>=~/200 OK/; close $s; }` [download]	[reply] [d/l]
Re: Verifying external web links by strat (Canon) on Dec 05, 2001 at 19:42 UTC
For extracting all links from a html-Page, I often use HTML::LinkExtor. The Perl-Cookbook says something like: `#!perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $baseUrl = $ARGV[0] \|\| die "Usage: $0 url\n"; my $parser = HTML::LinkExtor->new(undef, $baseUrl); $parser->parse( get($baseUrl))->eof; foreach ( $parser->links){ my ($eltType, @elements) = @$_; while (@element){ my ($attrName, $attrValue) = splice(@element, 0, 2); print "$eltType: $attrName, $attrValue\n"; } # while } # foreach` [download]	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks