Checking "incomplete" URLs

by nop (Hermit)
on Feb 18, 2002 at 22:37 UTC ( #146275=perlquestion: print w/replies, xml ) Need Help??
nop has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple program to check the validity of links before submitting them to search engines (code below.)

This code declares that "incomplete" URLS, like are no good -- when, in fact, they work just fine in a browser (and when submited to search engines). The "problem" with these URLs is that the full path all the way down to file isn't explicitly specified, I think.

My question is how do I get LWP useragent to act like a browser and find the default page in a directory?



package MyUA; use base qw(LWP::UserAgent); use strict; use CGI qw/:standard/; sub redirect_ok {1}; sub new { my $class = shift; my $self = new LWP::UserAgent; bless($self, $class); return $self; } sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request POST => $url; my $res = $self->request($req); my $content = $res->content; return 0 unless $res->is_success; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 1; } 1;

Re: Checking "incomplete" URLs
by rob_au (Abbot) on Feb 18, 2002 at 23:54 UTC
    This is fairly straight-forward to fix - Try changing your validURL subroutine to read thus:

    sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request HEAD => $url; my $res = $self->request($req); my $content = $res->content; return 0 unless $res->is_success; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 1; }

    Note that I have changed the request method from POST to HEAD - The POST method will not be allowed for most URLs (thereby generating your false-negative results) and while this could be changed to a GET request, the HEAD request method will be more successful for all "valid" URLs, irrelevant of the preferred request method.


Re: Checking "incomplete" URLs
by BlueLines (Hermit) on Feb 19, 2002 at 02:53 UTC

    My question is how do I get LWP useragent to act like a browser and find the default page in a directory?

    It has nothing to do with your browser, and everything to do with your web server. I tested your example on a site I had control of (running apache). Here's what happened:
    [jon@valium jon]$ telnet 80 Trying Connected to Escape character is '^]'. GET /decss HTTP/1.0 HTTP/1.1 301 Moved Permanently Date: Tue, 19 Feb 2002 02:47:50 GMT Server: Apache/1.3.22 (Unix) (Red-Hat/Linux) mod_ssl/2.8.5 OpenSSL/0. +9.6b mod_perl/1.24_01 Location: Connection: close Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>301 Moved Permanently</TITLE> </HEAD><BODY> <H1>Moved Permanently</H1> The document has moved <A HREF=""> +here</A>.<P> <HR> <ADDRESS>Apache/1.3.22 Server at Port 80</ADDRE +SS> </BODY></HTML> Connection closed by foreign host.
    The web server sent me a 301 since /decss wasn't an actual file, but rather, a directory. My web browser followed that redirect automatically, which is what browsers are supposed to do when the http method used is GET or HEAD. I suspect your troubles are caused because you are using the POST method, which is explicitly forbidden to redirect you without notifying the user.


      By default, LWP::UserAgent automatically follows redirects for any request except a POST. The redirect_ok() method controls this behavior:
      $ua->redirect_ok This method is called by request() before it tries to do any redirects. It should return a true value if a redirect is allowed to be performed. Subclasses might want to override this. The default implementation will return FALSE for POST request and TRUE for all others.
      Recently I had to write a script which posted a form on a remote site, and then checked the text of the resulting page to make sure the post succeeded. Unfortunately, there was a redirect to that page.

      First I tried a making a subclass with a new redirect_ok() that always returned 1. Unfortunately, LWP::UserAgent used a POST request for the redirect; the remote server returned a 405 error. I ended up writing a redirect_ok() which replaced the POST request object in @_ with a new one that did a GET instead. Ugly, but it worked!

        You could upgrade to latest libwww and just use method requests_redirectable from LWP::UserAgent
        $ua->requests_redirectable( ); # to read $ua->requests_redirectable( \@requests ); # to set This reads or sets the object's list of request names that "$ua->redirect_ok(...)" will allow redirection for. By default, this is "['GET', 'HEAD']", as per RFC 2068. To change to include 'POST', consider: push @{ $ua->requests_redirectable }, 'POST';

        Ilya Martynov (

      Hurrah! GET (vs. POST) solved it -- Many thanks, BlueLines! ++
      sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request GET => $url; my $res = $self->request($req); my $content = $res->content; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 0 unless $content =~ /\S/i; return 1; }
Re: Checking "incomplete" URLs
by Anonymous Monk on Feb 19, 2002 at 04:30 UTC
    Finding the default page is done by the server. not the browser the browser client sends whatever it wants, and the server decides what to send back.
Re: Checking "incomplete" URLs
by erikharrison (Deacon) on Feb 20, 2002 at 05:03 UTC

    When you request a directory on a webserver the server gets to decide what to send you - usually index.html but not necesarily. This is straight ought of lwpcook.pod:

    "If you just want to check if a document is present (i.e. the URL is valid) try to run code that looks like this:

    use LWP::Simple; if (head($url)) { # ok document exists }

    . . ."

    . . . which is the "canonical" way to make sure a url is valid.


