http://www.perlmonks.org?node_id=166008

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm a newbie to Perl. I have a site with various external URLs in a flat-file database. If I have the URL in a variable in my program, what would I have to do to check the URL to see if it's dead or not? I read through other similar questions. I didn't get it. One, I'm not sure if I have LWP. Two, I just didn't get it. The best I found was this:
use strict; use IO::Socket::INET; for (@ARGV){ s|http://||; m|([^/]+)(.*)|; my $s=IO::Socket::INET->new(PeerAddr=>$1,PeerPort=>80,Proto=>'tcp',Ty +pe=>SOCK_STREAM); print $s "GET ".($2||'/')." HTTP/1.0\nHost: $1 \n\n"; print "Link $_ is validated\n" if <$s>=~/200 OK/; close $s; }
However, I'm clueless as to why it's a for. Also, it looks like it's reading from a default variable and I don't know how to set that. Could anyone explain either what this code is doing (line by line) or give me a better snippet of code? I don't need something that checks many links, just one.

Replies are listed 'Best First'.
Re: Can You Explain How to Check a Link for Deadness
by choocroot (Friar) on May 12, 2002 at 19:39 UTC
    # The script take the URL list on the command line # So, you should call your script like this : # perl myscript.pl http://site/f1.html http://site/f2.html ... # Command line arguments are stored in the ARGV list. # For each URL in ARGV, place this current URL in $_ # $_ the the "default" variable in Perl. for (@ARGV){ # Remove the leading "http://" part of $_ s|http://||; # Extract the server name in $1 and the file name in $2 # from $_ (see "perlre" documentation for this) m|([^/]+)(.*)|; # Open a tcp socket connection to the server $1 on port 80 my $s=IO::Socket::INET->new( PeerAddr=>$1, PeerPort=>80, Proto=>'tcp +', Type=>SOCK_STREAM ); # Send a simple HTTP GET request to the server for # file $2 or "/" if $2 is not defined. print $s "GET ".($2||'/')." HTTP/1.0\nHost: $1 \n\n"; # Read the first line of the answer (with <$s>) from the # server and print "Link xxx is validated" if the server # answered positively to the request (server answers # "HTTP 200 OK" when file is present) print "Link $_ is validated\n" if <$s>=~/200 OK/; # Close the connection close $s; # and treat the next URL }

    You can use the LWP package (launch perl -e 'use LWP' to check if LWP is installed).
    With LWP this could be rewritten like this:

    use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; foreach my $url (@ARGV) { my $request = HTTP::Request->new( GET => "$url" ); my $response = $ua->request( $request ); if( $response->is_success ) { print "link $url is ok\n" } }
    LWP provide a higher abstraction, you don't need to handle the "low level" socket creation/communication.
    Read the documentation for LWP and HTTP::Request for futher details.

    Good luck :)

      A pretty thorough explanation, ++. A tip, perl -MLWP is shorter and will do the same thing.

      --
      perl -pew "s/\b;([mnst])/'$1/g"

Re: Can You Explain How to Check a Link for Deadness
by DigitalKitty (Parson) on May 12, 2002 at 20:19 UTC
    I don't need something that checks many links, just one.

    Hi.

    One 'quick and dirty' solution is to use the LWP::Simple module and check the return value of the url that was entered.

    #!/usr/bin/perl -w use strict; use LWP::Simple; my $url; my $site; print "URL to check: "; chomp($url = <STDIN>); $site = get($url); if($site) { print "$url is good.\n"; } else { print "$url appears to be broken.\n"; } Sample run with output: C:\perl>perl linkcheck.pl URL to check: http://www.perlmonks.org http://www.perlmonks.org is good. C:\perl>perl linkcheck.pl URL to check: http://www.google.com http://www.google.com is good. C:\perl>perl linkcheck.pl URL to check: http://www.blahblahblah.com http://www.blahblahblah.com appears to be broken. C:\perl>


    Hope this helps,

    -DigitalKitty
      Substituting the call to the get function in the LWP::Simple for a call to head you could save a good amount of time as the head function checks only for the presence of the page instead of downloading it all.


      $|=$_="1g2i1u1l2i4e2n0k",map{print"\7",chop;select$,,$,,$,,$_/7}m{..}g

Re: Can You Explain How to Check a Link for Deadness
by tachyon (Chancellor) on May 13, 2002 at 01:18 UTC
    use LWP::Simple; my $page = 'http://www.perlmonks.org/'; $headers = head($page); print $headers->{'_msg'}," ", $headers->{'_rc'}, "\n\n"; # have a look at the info we get back for interest sake use Data::Dumper; print Dumper $headers; __DATA__ OK 200

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Can You Explain How to Check a Link for Deadness
by alien_life_form (Pilgrim) on May 12, 2002 at 20:51 UTC
    Greetings,

    In addition to all the good things that have been said about LWP, there are things that LWP gets right that the sample code does not: authentication and proxies, for instance (though I am not sure that LWP::Simple foots the bill completely)

    As for checking wether you have LWP:

    perl -MLWP -e 'print "Hello\n"'

    Cheers,
    alf
    You can't have everything: where would you put it?
Re: Can You Explain How to Check a Link for Deadness
by CharlesClarkson (Curate) on May 13, 2002 at 03:40 UTC

    Don't throw away a link that fails. Stick it in another file to be checked again later. Servers go down and sometimes dead links aren't dead.


    HTH,
    Charles K. Clarkson
    Clarkson Energy Homes, Inc.