Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Verifying external web links

by nu2perl (Initiate)
on Dec 05, 2001 at 18:34 UTC ( [id://129607]=perlquestion: print w/replies, xml ) Need Help??

nu2perl has asked for the wisdom of the Perl Monks concerning the following question:

I maintain several web sites. It would be handy for me to be able to run a script and report back on what external references no longer exist. I do not want to retrieve the actual page, just the status. I realize this does not guarantee that the content has not changed since I linked to it, but knowing it is still there would be a good start. Thanks in advance.

Replies are listed 'Best First'.
Re: Verifying external web links
by Masem (Monsignor) on Dec 05, 2001 at 18:58 UTC
    As suggested already LWP is your solution.

    However, I will point out that any solution should not be a 'try once and fail', but should instead be along the lines of '3 strikes and then fail'. That is, with the connectivity of the internet today, while most major commercial sites are up 99.9+% of the time, many off beat sites will sometimes be inaccessable due to lower-grade ISP (eg dealing with residental broadband). These sites might not be up at the time you try them, but maybe 2 mins, 2 hours, or 2 days later they will be. The best way to do link checking is to test a site; if not there, try it again the next day, then the next week, and then possibly the week after that, ideally at sufficiently different times of the day (midnight, 6a, noon, 6p). This should cover things like DNS resolution issues, network outages, and equipment replacements that might occur. If a site fails all 3 or 4 times, then it's probably gone.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

      Whether or not you retry should depend on the nature of the failure. If you get back a 400- or 500-series response, you should generally stop there, since the server has pretty much stated, "No way, no how." A possible exception to this would be a 408 (timeout) response and arguably 500, since it's possible the error is temporary.

      If, on the other hand, (like you discuss), the request fails due to a connection problem (connection refused, timed out, no route to host), I might wait a bit (hours? days?) and try again.

        I'd argue that 404 should be rechecked too, though most likely, any site that starts off with a 404 error will end up off the list, more so than 408s, 500s, or connection problems. Sometimes, if you've linked 'deep' into a site (anywhere off the front page, or in a user's account), the server's storage might be switched around, and in a short time frame, you might get 404s, but outside, the page would be accessible normally. There's other reasons that I can think of as well, which are not unlikely but are uncommon, that I'd check pages repeatedly regardless of error.

        That said, it certainly would not be too hard with such a tool to report in a log file why links were removed, allowing for the person to chase down those that might be recoverable (404s commonly), as opposed to those that are probably lost for good (no connection over serveral attempts).

        -----------------------------------------------------
        Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
        "I can see my house from here!"
        It's not what you know, but knowing how to find it if you don't know that's important

(RhetTbull) Re: Verifying external web links
by RhetTbull (Curate) on Dec 05, 2001 at 19:38 UTC
      First off, thank you (and everyone else) for taking the time to respond to my question. yes, indeed you are right. Merlyn did write an article about it (Web Techniques, Oct 1996). What confused/intimidated me was a line that said "Line 3 extends the built in library search path to include the location of my locally downloaded CPAN items". Please do not flame me for my ignorance. Installing LWP seemed to have several dependencies itself. If only the required libraries were listed, it would have been easier. I'm not looking to be spoonfed the answer, but these sorts of ambiguities make it easy for a task like to slide to the back burner.
Re: Verifying external web links
by CubicSpline (Friar) on Dec 05, 2001 at 18:39 UTC
    I do the same sort of thing and I've just used the LWP::UserAgent and HTTP::Request::Common modules to request the external resource and check the HTTP headers for 404. Would something like that work for you?

    ~CS

Re: Verifying external web links
by Lucky (Scribe) on Dec 05, 2001 at 19:23 UTC
    Simple solution. Not comprehensive, but it works without LWP.
    use strict; use IO::Socket::INET; for (@ARGV){ s|http://||; m|([^/]+)(.*)|; my $s=IO::Socket::INET->new(PeerAddr=>$1,PeerPort=>80,Proto=>'tcp',Ty +pe=>SOCK_STREAM); print $s "GET ".($2||'/')." HTTP/1.0\nHost: $1 \n\n"; print "Link $_ is validated\n" if <$s>=~/200 OK/; close $s; }
Re: Verifying external web links
by strat (Canon) on Dec 05, 2001 at 19:42 UTC
    For extracting all links from a html-Page, I often use HTML::LinkExtor.

    The Perl-Cookbook says something like:

    #!perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $baseUrl = $ARGV[0] || die "Usage: $0 url\n"; my $parser = HTML::LinkExtor->new(undef, $baseUrl); $parser->parse( get($baseUrl))->eof; foreach ( $parser->links){ my ($eltType, @elements) = @$_; while (@element){ my ($attrName, $attrValue) = splice(@element, 0, 2); print "$eltType: $attrName, $attrValue\n"; } # while } # foreach

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://129607]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-23 20:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found